Setting a Hard threshold for modified base aggregation

nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.

Other

193 stars 30 forks source link

Setting a Hard threshold for modified base aggregation #47

Closed AAnnan closed 4 years ago

AAnnan commented 4 years ago

Hello,

I am trying to use of the --mod-binary-threshold option in Megalodon's mod base aggregation. It is described as (probability of modified/canonical base) and has a default value of 0.75. My main question is: how is it calculated exactly?

I would like to use other Megalodon outputs to calculate the threshold that fits best my data and my analysis, however I haven't succeeded yet.

My process was:

For 2 libraries (fully methylated and negative control) processed by Megalodon from _per_read_modified_basecalls.txt calculate per_read_scores = mod_log_probs - can_log_prob. Set a range of thresholds, for each one calculate the methylation proportion (#methylated sites/#non-methylated sites)in each sample. Get TP: True Positives (Proportion of Methylated Sites in Methylated library) Get FP: False Positives (Proportion of Methylated Sites in Negative Control) Pick the threshold that maximize TP-FP. Aggregate real data with this threshold.

I tried this and per_read_scores = exp(mod_log_probs - can_log_prob). For the log ratio I got a threshold of 0.89, for the exp(log ratio) I got a threshold of 2.42.

I found 0.89 gave me too many FP still, while 2.42 didn't count any site as methylated.

marcus1487 commented 4 years ago

As noted in the help text (from megalodon --help-long) the --mod-binary-threshold value is "Threshold for modified base aggregation (probability of modified/canonical base).". The default value of 0.8 means that to be counted as a modified base the probability of a particular modified base must be greater than 0.8. For a read to be counted as canonical the canonical base probability must be greater than 0.8. For any position where no probability is greater than 0.8 neither canonical not modified bases get a count.

This was done for a number of reasons. First is to accommodate multiple alternative bases to a canonical base (such as the 5mC/5hmC model). A log likelihood threshold could add a count to both 5mC and 5hmC for a single read which could make downstream analysis confusing. Second was to remove non-confident calls thus producing more accurate fraction modified estimates. Note that this is the same strategy used by nanopolish (though the threshold is set in log likelihood space which is valid as there are no models in nanopolish with multiple alternative/modified bases).

The fraction of uncertain calls for each model from calibration data can be found in the plots in the Megalodon repository under megalodon/megalodon/model_data/*/megalodon_mod_calibration.pdf (for example see 5mC CpG minion R9.4.1 Rerio model here). Note also that these plots can be created for your own data with the megalodon_extras calibrate modified_bases command and you can set multiple thresholds to test with the --pdf-prob-thresholds argument. See more docs on model calibration here.

AAnnan commented 4 years ago

Thanks for your help. I'm new to Nanopore data and I'm still trying to figure out how this all works.

I launched megalodon_extras calibrate generate_modified_base_stats using my fully methylated control data as --control-megalodon-results-dir and the data I want to call as input. It's followed by megalodon_extras calibrate modified_bases using the stats just created as --ground-truth-llrs. I'll then relaunch Megalodon on my data using the newly created calibration file as --mod-calibration-filename, as well as observe the calibration plots to choose an aggregation threshold.

I had a look at the mod calibration plot for the model I'm using (Rerio's all context). It doesn't look like this model will be very good at separating Zs from Cs. I'm trying to call methylated Cytosins in 2 contexts, GC and CG, which makes for 3 non-overlapping contexts, HCG GCH and GCG. That's my input for Megalodon's --mod-motif. Could you possibly confirm that res_dna_r941_min_modbases-all-context_v001.cfg is the model I want to use?

A last question concerning aggregation and downstream analysis: The BedMethyl file doesn't not contain information about the motif the mod base was found in. Therefore, in my case, I cannot merge information from the two strands: In case of CpG calls, forward and reverse strand information can be merged by subtracting 1 to the positions of basecalls on the reverse strand. In the same logic GpC calls would require to add 1 to the positions of basecalls on the reverse strand, and GCG calls would need to be left as is. Does this mean that in order to merge BedMethyl information from both strand and from all motifs, I'd need to run Megalodon 3 times, once for each motif ?

marcus1487 commented 4 years ago

For the first question, res_dna_r941_min_modbases-all-context_v001.cfg is likely the best model for all context 5mC calling, though the res_dna_r941_min_modbases_5mC_CpG_v001.cfg model does show some ability to detect 5mC outside of CpG contexts. I would test this on your data to see what works best. As a model to mark up your own data for model training I would doubt that the difference between these two models would make much difference. A new all-context 5mC model should be coming to Rerio soon and then that one should be used in place of both of these models for 5mC detection.

For the second set of questions, this is actually a situation I've run into recently. Given that the main megalodon command is more computationally expensive (has to run the basecaller) I would recommend running megalodon only once, but if you have ample compute running megalodon multiple times is a solution here. I would instead recommend using third party tools to separate the bedmethyl files. A bed file could be created with the locations of the 3 motifs, then bedtools intersect could be used to extract these locations from the main megalodon output file.

Given that all of this is quite complicated, in the next release (early next week), a new megalodon_extras command has been added which will split a modified bases database by reference sequence motif. This will involve re-running megalodon since the modified base database schema has also changed in this forthcoming release. Once this next release is out this command would be the recommended solution for this situation, though it still might be slower than the bed files with motif locations solution.

AAnnan commented 4 years ago

Thanks for your input. I look forward to try out the new model from Rerio and Megalodon's next release.

AAnnan commented 4 years ago

Hello Marcus,

On the same topic of modified base aggregation, I picked a position at random on chrV and looked at the aggregate data for 2 different thresholds (k), then I looked at the same position in the per_read_modified_base_calls file and checked if they matched.

Following your above explanation of the binary threshold, I expected: for k=0.6: 9 reads with a mod_prob > 0.6 , 0 reads with a can_prob > 0.6 for k=0.85: 5 reads with a mod_prob > 0.85 , 0 reads with a can_prob > 0.85

However, this is not what I got. From the table below, only 6 reads have a mod_prob > 0.6 and 4 > 0.85.

Am I looking at this the wrong way?

BedMethyl k=0.6
chrV    12110094    12110095    .   9   +   12110094    12110095    0,0,0   9   100
BedMethyl k=0.85
chrV    12110094    12110095    .   5   +   12110094    12110095    0,0,0   5   100

per_read_modified_base_calls unlogged
read_id|chr|strand|pos|mod_prob|can_prob
947e2cf6-8490-4b2c-b4d0-78120e3de6ea|chrV|1|12110094|0.189376|0.216537
98a4a197-ebec-436a-bc8b-53f0ca118ffa|chrV|1|12110094|0.990151|3.53499e-06
0dd16e4f-deda-4175-8adb-0c377fba0405|chrV|1|12110094|0.735863|0.00828431
ac4630f5-2e85-46bb-bf0d-495bdcb9136b|chrV|1|12110094|0.371444|0.0889038
62ca5063-c25e-471d-abcd-42e34217aa3e|chrV|1|12110094|0.959327|9.45096e-05
45c0808a-9bd3-44d1-8ae6-6d06fc88eabb|chrV|1|12110094|0.966536|6.00149e-05
c4a3e37f-29fc-4a06-9021-49bf72378c60|chrV|1|12110094|0.680048|0.013503
7061399b-96ad-4ad0-b33b-27120d580863|chrV|1|12110094|0.37811|0.0859778
f167c21e-bb29-4b45-9e54-e96fb0d0fa0d|chrV|1|12110094|0.332723|0.107702
34d91e65-c976-4a65-bc60-dc780c907bfa|chrV|1|12110094|0.960644|8.75288e-05

marcus1487 commented 4 years ago

It is odd that the mod probability and canonical probabilities do not sum to 1. Does the model used here have more than 1 modified base (e.g. 5mC and 5hmC model)? Would it be possible to provide the raw output of the per_read_modified_base_calls file for this position?

AAnnan commented 4 years ago

The model used is the All-Context 5mC by Rerio: res_dna_r941_min_modbases-all-context_v001.cfg

This is the raw output of per_read_modified_base_calls for this position:

947e2cf6-8490-4b2c-b4d0-78120e3de6ea chrV 1 12110094 -0.7226739977359252 -0.6644672502055191 Z GCG:1
98a4a197-ebec-436a-bc8b-53f0ca118ffa chrV 1 12110094 -0.004298609357603844 -5.451612248558269 Z GCG:1
0dd16e4f-deda-4175-8adb-0c377fba0405 chrV 1 12110094 -0.13320291900470727 -2.0817438830471344 Z GCG:1
ac4630f5-2e85-46bb-bf0d-495bdcb9136b chrV 1 12110094 -0.4301063976071987 -1.0510797315094773 Z GCG:1
62ca5063-c25e-471d-abcd-42e34217aa3e chrV 1 12110094 -0.018033556997778533 -4.024524207193623 Z GCG:1
45c0808a-9bd3-44d1-8ae6-6d06fc88eabb chrV 1 12110094 -0.014781790828441999 -4.221740996351207 Z GCG:1
c4a3e37f-29fc-4a06-9021-49bf72378c60 chrV 1 12110094 -0.1674604053933888 -1.8695703591983042 Z GCG:1
7061399b-96ad-4ad0-b33b-27120d580863 chrV 1 12110094 -0.4223819447293963 -1.0656136803882856 Z GCG:1
f167c21e-bb29-4b45-9e54-e96fb0d0fa0d chrV 1 12110094 -0.47791781166605524 -0.9677765649738068 Z GCG:1
34d91e65-c976-4a65-bc60-dc780c907bfa chrV 1 12110094 -0.0174373077511909 -4.057849229236101 Z GCG:1

This is the Megalodon command for this output:

megalodon ./final_fast5s/${lib} --guppy-server-path ${GUPPY_DIR}/guppy_basecall_server \
        --guppy-params "-d ./rerio/basecall_models/" \
        --guppy-config res_dna_r941_min_modbases-all-context_v001.cfg \
        --outputs basecalls mappings mods per_read_mods mod_mappings \
        --output-directory ./${lib}_${mod_motifs[i]} \
        --reference $genomeFile \
        --mod-motif Z GCG 1 \
        --write-mods-text \
        --mod-aggregate-method binary_threshold \
        --mod-binary-threshold 0.6 \
        --mod-output-formats bedmethyl wiggle \
        --mod-map-base-conv C T --mod-map-base-conv Z C \
        --devices 0 --processes 28

marcus1487 commented 4 years ago

It looks like the issue might be with the probability conversion. The values stored in the per-read calls text file (and the database) are the natural log and not log 10.

AAnnan commented 4 years ago

Alright, that makes sense, thank you.

franrodalg commented 3 years ago

Hi Marcus!

Sorry for resurfacing this, but is the fact that the per-read call files report natural log documented anywhere?

Cheers, Fran

marcus1487 commented 3 years ago

I don't know if it is specifically noted that the logarithm is the natural logarithm. I will try to remedy this in a future release.