nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
230 stars 54 forks source link

How to extract per read stats into a text format? #151

Closed JiahuiWangGit closed 5 years ago

JiahuiWangGit commented 5 years ago

Hi,

Is there any way to extract the per read results (detect_modifications command via the --per-read-statistics-filename option) into a text file? I would like to check the methylation percentage per read.

Thanks, Jiahui

marcus1487 commented 5 years ago

The Tombo python API has functionality to access the per-read statistics files (see docs here). Using this template the stats could be output into a suitable text format.

Note that the per-read statistics are reported are statistical test values (depending on the test applied; see here) and should not be interpreted as percent methylation. The ground truth is that on a per-read basis each base is either modified or not (there is no real sense of per-read percent methylation; just confidence the site is methylated or not).

JiahuiWangGit commented 5 years ago

Hi, I was able to extract the per-read stats into a text format like this 9905155 7.315355351035061 b'fbbba58b-bda5-4b9d-8035-e2759a572206' using command sample_per_read_stats = tombo_stats.PerReadStats( 'mathylation_testing.5mC.tombo.per_read_stats'); reg_per_read_stats = sample_per_read_stats.get_region_per_read_stats( reg_data) using the output from tombo detect_modifications alternative_model --fast5-basedirs $path_to_fast5s --statistics-file-basename mathylation_testing --per-read-statistics-basename mathylation_testing --alternate-bases 5mC --processes 20, But when I compare the roc curve plotted using mathylation_testing.5mC.tombo.stats file tombo plot roc --statistics-filenames mathylation_testing.5mC.tombo.stats --modified-locations "Bisulfite Ground Truth":bisulfite.ENCFF835NTC.example.mod.bed --unmodified-locations bisulfite.ENCFF835NTC.example.unmod.bed --genome-fasta $genome_fasta and the roc curve plotted using the extracted per-read stats text file (using my own script, with cutoff=2.5), the AUC was quite different (0.75 vs 0.29). I am wondering what could be the problem here?

marcus1487 commented 5 years ago

There are many issues that may be contributing to this as I don't know exactly how your script is working, but I will cover a couple of possibilities here.

The first is that you are extracting per-read statistics, but running the tombo plot roc command on the aggregated statistics file. The per-read and aggregated (over reads) performance would be quite different. For ROC curve output using per-read statistics see the tombo plot per_read_roc command.

Given that the computed AUC is below 0.5 (random guessing) I suspect that a number of other issues may be at play here. First note that negative per-read scores indicate modified bases and positive scores indicate a canonical base. Second note that the default cutoff of 2.5 is a two-way threshold, so only scores below -2.5 or above 2.5 contribute to the aggregated fraction modified score.

For more details on the ROC computation see the roc computation function here.