nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
230 stars 54 forks source link

Levelstats statistics file raw data extract to csv #359

Open bhargava-morampalli opened 3 years ago

bhargava-morampalli commented 3 years ago

I ran the following command on my direct RNA sequencing data

tombo detect_modifications level_sample_compare \
   --fast5-basedirs /native/singlefast5/ \
   --alternate-fast5-basedirs /ivt/singlefast5 \
   --statistics-file-basename level_testing_strain \
   --store-p-value \
   --statistic-type ks --processes 30

I want to extract the data from the resulting stats file and have used tombo api as follows

from tombo import tombo_helper, tombo_stats, resquiggle
import pandas as pd

sample_level_stats = tombo_stats.LevelStats('/data/level_testing_strain.tombo.stats')
reg_level_stats = sample_level_stats.get_reg_stats('chrm', '+', 1, 1525)
pd.DataFrame(reg_level_stats).to_csv("/results/tombotest.csv")

and the resulting csv looks like this.

,stat,pos,cov,control_cov
0,2.6928592163926735e-28,2,219,456
1,1.1185329170881968e-21,3,226,463
2,4.624989306606759e-18,4,261,529
3,1.7881359403179843e-25,5,306,533
4,2.540133370261695e-69,6,880,567
5,9.020681930756034e-76,7,1391,574
6,2.3636818898014833e-85,8,1754,578
7,1.1817788672225994e-58,9,2731,582
8,4.566511057754994e-49,10,3743,586

the first column I assume is just the index.

How do I interpret the statistic in 2nd column -> closer to one as modified (guessing this is probably the case) or most significant (<0.05)?

3rd column is the position of the nucleotide in the reference, coverage for sample and control in 4th and 5th columns

Am I correct in the steps I did for extracting statistics info from the level_sample_compare command?

SycamoreLeaf commented 3 years ago

The first column is left by the to_csv method. Use index=False to get rid of it. The second column is the p-value from a Kolmogorov-Smirnov test of two populations of current levels, one from the sample and the other from the control. The p-values are lower when the sample and control differ more.