tombo text_output browser_files for denovo modification

LeilyR commented 4 years ago

Hi,

I have used the statistic file type for a de novo detected modifications. I saw in the documentation you were using it for level_sample_compare , but Could not fins an example for the de novo stat file. Am I right that the reported values in this wig file are the p values of the statistic test has been used for the de novo detection per location? If so, does it make sense to filter out the positions with p values bigger than let's say 0.05? for example these ones: 5 0.2500 6 0.4000 7 0.4000 8 0.3333 9 0.3333 10 0.2500 11 0.5000 12 0.5000
Also about the reverse strands, are they irrelevant for the direct RNA?
Could you also please direct me to where I could find the information about how the signal has been computed for the file type signal in text_output browser_files. Thank you much!

marcus1487 commented 4 years ago

1) The statistic output type (from the tombo text_output browser_files command) should not be allowed for a de novo statistics file. There is no p-value associated with an aggregated de novo statistics file. Statistics files generated from detection methods which produce per-read statistics are aggregated to produce fraction and coverage values. Level comparison methods on the other hand produce p-values and effect size statistics (depending on the type of test employed) which can be output via the statistics output.

Are you sure that the statistics file was from a tombo detect_modifications de_novo statistics file? Could you post the exact commands used to produce this statistics output file?

2) This is a question for your sample of interest. Many biological sample produce valid direct RNA reads mapping to the reverse strand. If you determine that reverse strand mapping reads are irrelevant for your research then they can indeed be ignored.

3) The signal output is the mean over reads of the normalized signal values assigned to that base. See docs on this output here and the re-squiggle algorithm to assign signal to reference bases here.

LeilyR commented 4 years ago

Hi Marcus,

thanks a lot for the answers.

Yes, I am sure it was from denovo detection, here is what I have done with Tombo version: 1.5.1: tombo detect_modifications de_novo --fast5-basedirs fast5 --statistics-file-basename denovo --rna --fishers-method-context 2 --minimum-test-reads 1 --per-read-statistics-basename perRead_denovo --num-most-significant-stored 20000 --processes 30 tombo text_output browser_files --fast5-basedirs fast5/ --statistics-filename denovo.tombo.stats --genome-fasta transcripts.fa --file-types statistic

and the output is the one I have already sent you in my previous comment.

So, then if I understood it correctly there is no p values to check for the significance of denovo or alternative models, right? What would be your suggestion? I am getting quite a lot of modified bases with fraction = 1, I have seen you recommendation in using dampened factions but I am not sure if I could understand how to filter them for the most significant positions. Am I just using an arbitrary threshold and keep the bases of dampened fraction above that? Thanks a lot again! Cheers, Leily

marcus1487 commented 4 years ago

Yes that is correct. There are no p-values to check for the per-reference site statistics files for de novo or alternative models. The p-values are computed on a per-read level, then these values are converted to binary modified or not values and a fraction is reported in the per-reference site statistics file.

I took a look and there was a bug in the way the browser_files command checked the outputs. The statistic output should not have been allowed given a de novo statistics file. What was output is the dampened_fraction output. I have pushed a fix to the github repo with this logic corrected.

For the more central issue of high false positives in RNA, this is a known issue. RNA signal is quite a bit trickier to model with a k-mer model, leading to a high false positive rate (lots of fraction modified = 1) for RNA de novo detection. The level_sample_compare method has shown good results for RNA modified base detection, but requires a control sample (e.g. IVT). We are working hard to provide alternative methods for modified RNA detection, but these will likely not be k-mer based due to these limitations.

LeilyR commented 4 years ago

Great! pretty helpful! would you still recommend setting a threshold on dampened fraction , or I better think of some integrative analysis (using another data/tool to confirm the modification such as motifs etc.) to filter some of those 1s from fraction file out?

marcus1487 commented 4 years ago

I would recommend some integrative analyses with a mind toward the potential high false positive rate. In particular many of the sites may be systematic bias due to the fact that the k-mer model does not adequately capture the variation in RNA signal. For example an enriched motif may be a modified base at that motif or a systematic error at that k-mer in a particular sample. There may be some value in using the de novo model for direct RNA modified base analysis, but these conclusions should certainly be verified by other means.

LeilyR commented 4 years ago

Thanks a million!

nanoporetech / tombo

tombo text_output browser_files for denovo modification #256