nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
225 stars 55 forks source link

Estimation of methylation is way off #207

Open Sebidoo964 opened 4 years ago

Sebidoo964 commented 4 years ago

Hi When performing the analysis of finding m6A in DNA using alternative model with the default values the methylation rates are too high, for the control ones which were pcr products causing misstrust in the methylated results. When counting the methylated bases I find a 90 - 98 % methylation across all the samples. I get this by counting the amount of positions presented in the wiggle files (dampened fraction) and dividing with the reference A positions. Why is this? Are the default likelihood settings too lenient? If so what should I change, since I should expect close to 0% methylation, give or take due to errors from the platform, in my pcr product. I tried to limit the amount of methylated positions by limiting the minimum % of allowed reads containing the methylated position since the default is 1 read. This however still does not represent the expected results.

marcus1487 commented 4 years ago

For the dampened fraction output, the locations for each location covered by a mapped read are reported. Thus each position output should not be considered modified. For each position within the dampened fraction output, the estimated fraction of modified bases at that position is reported. Thus even a position with an estimated 0% of modified bases would be included in the output file. In order to obtain a more accurate estimate for the fraction of modified bases from a sample, the fraction of modified bases could be combined across all sites (ideally including coverage information).

Sebidoo964 commented 4 years ago

Okay I must have misunderstood the output and it's meaning. I am however still a little unsure of how I should proceed to assess if the positions actually/ certainly are modified. Could you please describe a little more of how I should proceed ,since the only information i get from the dampened result wiggle files are the two columns and nothing more. Are there some added commands I should include in any of the steps (text_output or detect_modifications)?

marcus1487 commented 4 years ago

Here is a link to the docs for the text output commands. From there you can also find a link to the wiggle format here. If this does not provide the answers for your analysis goals, please post here.

Sebidoo964 commented 4 years ago

I have looked at both these page before and still have doubts of how to actually come to the conclusion of which sites are methylated or not. In your first response, do you mena i can specify more than one filetype simultaneously or should i test for several filetypes seperately and then kind of merge/ overlap that information to assess the methylated positions and how certain it is? I am quite new to tombo and nanopore so I would really appreciate it if you could give more specific explanations to what commands i should run to achieve the sites that are methylated.

marcus1487 commented 4 years ago

Defining methylated sites versus un-methylated sites is a bit complicated as methylation often occurs at a fractional level. This is why tombo outputs an estimated fraction of modified bases at each position. In fact most positions are not 100% modified or 0% modified at the ground truth level, though there are certainly biological examples of a wide range of methylation profiles. Generally the read coverage provides a measure of the confidence for a particular fraction methylated estimate, but tombo does not provide an explicit output to measure the confidence in a methylated fraction estimate. Tombo does provide per-read statistics (as an optional output from the detect_modifications command), that could be used to assess the methylation confidence on a per-read/per-site level.

At a higher level, defining "fully methylated" versus "fully un-methylated" sites is more a task left to the user. As an example reasonable values might be 5% and 95% methylation (potentially with a 10X coverage filter), where sites with methylation fraction between these values could be ignored. The tombo output is intended to provide the starting point for further analyses. Depending upon the goals of the analyses a different tombo result may be more or less useful. Please post if there is a particular analysis goal for which I could provide further assistance.

Sebidoo964 commented 4 years ago

Ok I still feel I haven't gotten an answer to how I should proceed to assess whether methylation has occurred/ exists or not. What it sounded like is that I can't detect that using tombo alone. I have created per-read statistics files but can't find how to incorporate the log likelihood in the output files. The reason I which to determine wether positions are methylated or not and to what percentage the methylation occurs in the sample is to verify the accuracy of Nanopore to detect m6A methylation and if it can be recreated for our samples. We wish to proceed to see the effects using m6A enzymes on cell-lines to see expressional changes and chromatin availability. But inorder to get there I need to verify if I in fact can verify if the enzyme is adding methylgroups on the appropriate bases. By doing PCR amplification of the control sample should strip away the methyl groups meaning 0% methylation should be recieved by the analysis. By presenting the percentage of methylgroups tombo decides are significant or has a higher probability of being methylated I can present the successful/ unsuccessful data to my supervisor. I still haven't gotten an answer on how I can achieve this using this analysis method.

marcus1487 commented 4 years ago

The fraction of methylated bases estimated by tombo at each reference position is found within the dampened_fraction wiggle file. As defined in the previously attached link, each line in a wiggle file contains the reference position and the score (here the fraction of methylated bases). These should be the two values required for your described analyses. Could you describe what data are missing in order to complete analyses you wish to complete?

Sebidoo964 commented 4 years ago

Well as I mentioned before, my control sample which has been stripped of methylation through PCR gets a 100% positions with fractions. As i limit the minimum allowed fraction using a perl script extracting the positions containing a higher or equal to fraction set (0.25, 0.5 and 0.75) I still get a very high percentage of positions with an allowed fraction. I gather this to be due to the default values that estimate the methylation to be too lenient. This is what explained in the first comment. Is it that only e.g. a fraction of 0.8 or above is considered a high probability of methylated site? is there some kind of considered limit that gives an answer to the question: Is this with a high likelyhood a methylated position or not.

I attached some files of how I limit and how o get the percentages. These are for the PCR products meaning if the default parameters that tombo has are too lenient more positions are percieved as methylated which in this case should be zero.

Screenshot 2019-09-09 at 20 42 34 Screenshot 2019-09-09 at 20 43 00
marcus1487 commented 4 years ago

These results certainly suggest a high error rate for 6mA detection. The all-context 6mA model generally has much higher error rate than the provided dam model. A more in depth option might be to train your own motif-specific model if this is applicable to your analysis.

It is possible that the log likelihood thresholds could be adjusted in order to improve your results. You can use the tombo plot per_read_roc command to visualize the distribution of log likelihood ratios (see example at the bottom of the page here). Unfortunately this function does not allow the control sample feature as in the tombo plot roc command, but should help determine if the log likelihood thresholds could be improved. This tombo plot roc command can also be used in order to compare the aggregated results to determine if there is good separation between your two samples given the higher false positive rate. Hopefully these tools will help to improve your results.

marcus1487 commented 4 years ago

Ah, I take back the non-existent feature for per-read roc curves. The command that may help you is the tombo plot sample_compare_per_read_roc command. Passing this command the per-read statistic files for your control and modified samples will produce a plot (on the last page of the resulting PDF) which shows the log likelihood distributions at a ground truth motif (which could just be an A; pass A:1:A to the tombo plot sample_compare_per_read_roc --motif-descriptions argument) for the two samples.

marcus1487 commented 4 years ago

The tombo plot sample_compare_per_read_roc command limits included statistics to only those positions found within a specific sequence motif. This is intended to be used in bacterial methylation comparisons between native and PCR'ed samples. If you would like to include all statistics computed at any A base for 6mA methylation detection, then the specified motif can be just a simple A motif as described.

Sebidoo964 commented 4 years ago

I tried using as you mentioned above [ --motif-description A:1:"met"\A:1:"control"\ ] but I get an error saying **** ERROR **** Invalid motif decriptions format. Format descriptions as: "motif:mod_pos:name[::motif2:mod_pos2:name2...]".

Maybe I missunderstood how to use it. I tried using the documentation as a guide, but apparently it doesn't work

marcus1487 commented 4 years ago

The option should look as follows: --motif-descriptions A:1:A

The --per-read-statistics-filenames and --per-read-control-statistics-filenames options define the control and sample statistics for this command. You can run any tombo command with the -h flag in order to see these and more detailed explanation of the parameters on the command line.

Sebidoo964 commented 4 years ago

So the AUC was 0.6515 and mean AP 0.7793. I am guessing these results are not that great.

Screenshot 2019-09-09 at 22 16 13 Screenshot 2019-09-09 at 22 16 20 Screenshot 2019-09-09 at 22 16 26
marcus1487 commented 4 years ago

I would agree with your analysis. This analysis pipeline has resulted in relatively poor results for 6mA detection. Changing the log likelihood thresholds may help to an extent, but this is likely not going to drastically change the results. This is likely due to the all-context model estimation procedure. If you can create a motif-specific ground truth data set, these results are likely to improve greatly.

For all-context 6mA detection, these are likely the best results currently, but we are working hard to improve these results with alternative methods. Specifically, improvements to the taiyaki and megalodon software should help address this specific application. I will try to post an update here once these improvements have been tested and released.

marcus1487 commented 4 years ago

Looking at the precision/recall curve once more, it does appear that a small fraction of sites ~5% do show strong separation from the control sample, while the remainder appear quite indistinguishable from the control sample (the linear portion of the precision/recall curve). In your "modified" sample, do you expect every reference A to be modified or only a subset. If it is the former, then this may be a matter of performing analysis only on those ground truth modified sites. Thus shifting the log likelihood thresholds may still hold some promise. You may try moving the threshold to say 4 and re-try your analysis (or use the tombo detect_modifications aggregate_per_read_stats command to re-apply thresholds to your per-read statistic files).

Sebidoo964 commented 4 years ago

The enzyme used should randomly add methylgroups to A and in theory in vitro up to 100%

marcus1487 commented 4 years ago

This is actually another special case that may be effecting your results. Modifying all of a particular base to another base has more global effects on both the signal normalization parameter estimation and the re-squiggle results. While these results are quite poor, the identification of 6mA within lower density modification reads should be better than this.

The problem is that modifying a particular base effects several positions of signal. Thus when most all of a particular base is replaced, essentially every expected signal levels across a read will change. This makes the re-squiggle task very hard (and introduces many errors) which effects all downstream results. Again we are working hard to improve this type of analysis, but this is a limitation of the current tombo implementation.

Sebidoo964 commented 4 years ago

I am trying changing the threshold but I do understand now that it is limited for the kind of verification I am looking for. For the aggregat _per_read_stats do I have to specify the --single-read-threshold as [4,0] or can I just write 4?

marcus1487 commented 4 years ago

Setting to 4 is equivalent to -4 4 which is probably a good threshold here. Here the first value indicates that any log likelihood ratio lower is considered modified while the second value indicates that any log likelihood ratio higher is considered canonical. I chose 4 by looking at your above per-read statistic plot.

marcus1487 commented 4 years ago

Was setting this value able to resolve this issue sufficiently?

Sebidoo964 commented 4 years ago

No real difference in significance could be observed. The model could only be used to differentiate the methylation between different samples at a global level, if one sample had more or less methylation, but due to the high error rate of the negative controls the estimation on a positional level could not be made. Since I used over methylated and in theory 0 methylation samples it seems to be difficult to even distinguish native methylation from the negative controls due to the high estimation of the negative controls.