Closed JiahuiWangGit closed 5 years ago
The Tombo python API has functionality to access the per-read statistics files (see docs here). Using this template the stats could be output into a suitable text format.
Note that the per-read statistics are reported are statistical test values (depending on the test applied; see here) and should not be interpreted as percent methylation. The ground truth is that on a per-read basis each base is either modified or not (there is no real sense of per-read percent methylation; just confidence the site is methylated or not).
Hi, I was able to extract the per-read stats into a text format like this 9905155 7.315355351035061 b'fbbba58b-bda5-4b9d-8035-e2759a572206'
using command
sample_per_read_stats = tombo_stats.PerReadStats( 'mathylation_testing.5mC.tombo.per_read_stats'); reg_per_read_stats = sample_per_read_stats.get_region_per_read_stats( reg_data)
using the output from
tombo detect_modifications alternative_model --fast5-basedirs $path_to_fast5s --statistics-file-basename mathylation_testing --per-read-statistics-basename mathylation_testing --alternate-bases 5mC --processes 20,
But when I compare the roc curve plotted using mathylation_testing.5mC.tombo.stats file
tombo plot roc --statistics-filenames mathylation_testing.5mC.tombo.stats --modified-locations "Bisulfite Ground Truth":bisulfite.ENCFF835NTC.example.mod.bed --unmodified-locations bisulfite.ENCFF835NTC.example.unmod.bed --genome-fasta $genome_fasta
and the roc curve plotted using the extracted per-read stats text file (using my own script, with cutoff=2.5), the AUC was quite different (0.75 vs 0.29).
I am wondering what could be the problem here?
There are many issues that may be contributing to this as I don't know exactly how your script is working, but I will cover a couple of possibilities here.
The first is that you are extracting per-read statistics, but running the tombo plot roc
command on the aggregated statistics file. The per-read and aggregated (over reads) performance would be quite different. For ROC curve output using per-read statistics see the tombo plot per_read_roc
command.
Given that the computed AUC is below 0.5 (random guessing) I suspect that a number of other issues may be at play here. First note that negative per-read scores indicate modified bases and positive scores indicate a canonical base. Second note that the default cutoff of 2.5 is a two-way threshold, so only scores below -2.5 or above 2.5 contribute to the aggregated fraction modified score.
For more details on the ROC computation see the roc computation function here.
Hi,
Is there any way to extract the per read results (detect_modifications command via the --per-read-statistics-filename option) into a text file? I would like to check the methylation percentage per read.
Thanks, Jiahui