novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
109 stars 31 forks source link

Epinano predict all bases modify #83

Closed pabloacera closed 3 years ago

pabloacera commented 3 years ago

Hi, I am running Epinano in a dataset with only 1 sample and the results shows that all bases are modify which is strange (almost 1M bases). I followed the instructions and generated a file human_guppy3.1.5.MD.plus_strand.per.site.csv with the header:

#Ref,pos,base,strand,cov,q_mean,q_median,q_std,mis,ins,del
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,126,A,+,1,31.00000,31.00000,0.00000,0.00000,0.00000,0.00000
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,127,T,+,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,128,C,+,1,3.00000,3.00000,0.00000,0.00000,0.00000,0.00000
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,129,T,+,1,4.00000,4.00000,0.00000,0.00000,0.00000,0.00000
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,130,T,+,1,6.00000,6.00000,0.00000,1.00000,0.00000,0.00000
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,131,T,+,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,132,G,+,1,19.00000,19.00000,0.00000,0.00000,0.00000,0.00000
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,133,T,+,1,7.00000,7.00000,0.00000,0.00000,0.00000,0.00000
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,134,T,+,1,7.00000,7.00000,0.00000,0.00000,0.00000,0.00000

Then I ran Epinano predict as I only have 1 replicate (without the KO) $ python Epinano_Predict.py --model ./models/rrach.q3.mis3.del3.linear.dump --predict data/human_guppy3.1.5.MD.plus_strand.per.site.csv --columns 6,9,11 --out_prefix data/human_mod_predictiiton

The header output is

#Ref,pos,base,strand,cov,q_mean,q_median,q_std,mis,ins,del,prediction,dist,ProbM,ProbU
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,126,A,+,1,31.0,31.0,0.0,0.0,0.0,0.0,mod,-6.878639146768847,0.9999490349710862,5.096502891392184e-05
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,127,T,+,1,17.0,17.0,0.0,0.0,0.0,0.0,mod,-6.072346132310454,0.9998352794856769,0.00016472051432329804
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,128,C,+,1,3.0,3.0,0.0,0.0,0.0,0.0,mod,-5.266053117852066,0.9994676189336538,0.0005323810663461272
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,129,T,+,1,4.0,4.0,0.0,0.0,0.0,0.0,mod,-5.323645476027666,0.9995104110674833,0.0004895889325167933
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,130,T,+,1,6.0,6.0,0.0,1.0,0.0,0.0,mod,-5.126624389325061,0.9993478857937979,0.0006521142062020863
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,131,T,+,1,17.0,17.0,0.0,0.0,0.0,0.0,mod,-6.072346132310454,0.9998352794856769,0.00016472051432329804
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,132,G,+,1,19.0,19.0,0.0,0.0,0.0,0.0,mod,-6.187530848661655,0.9998606953874444,0.00013930461255558826
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,133,T,+,1,7.0,7.0,0.0,0.0,0.0,0.0,mod,-5.496422550554464,0.9996192337186937,0.00038076628130626793
ENST00000054650.8|ENSG00000041988.15|OTTHUMG00000001440|OTTHUMT00000004203.1|THAP3-201|THAP3|1366|protein_coding|,134,T,+,1,7.0,7.0,0.0,0.0,0.0,0.0,mod,-5.496422550554464,0.9996192337186937,0.00038076628130626793

And it continues being all modify. Please, is there anything I am missing? Anything will be helpful. Cheers.

Huanle commented 3 years ago

Hi @pabloacera ,

From what I can see, you were making predictions with all sites. But the model was trained on m6A motifs. Another thing is that the coverages also seem to be quite low. Can you filter your data/results on coverage (e.g., >=30) and motifs (i.e., the As from RRACH) and see if the results look more reasonable? It would be nice to have replicates and WT-KO contrasts, in order to remove false positives.

Stakaitis commented 3 years ago

Hi @Huanle,

I'm having a similar problem, except all of the RRACH motifs are unmodified. However, I did not filter the 5mer.csv file generated with the Slide_Variants.py for coverage (e.g., >=30) since the coverage is listed for each nucleotide (e.g. 37:37:37:37:37). How can I filter it if the coverage is not presented as one number?

Thank You, Rytis

enovoa commented 3 years ago

Hi @pabloacera , as @Huanle mentioned, we do not recommend to run EpiNano on single samples, despite being possible, it will be full of false positives. Additionally, when running the SVM model, only RRACH k-mers should be considered, as those are the ones that the model was trained for, so you need to subset for those. So in short, when using EpiNano-SVM you should filter the predictions by those that are not present in the KO/condition2. In addition, you should only consider RRACH sites in your analysis. Thanks!

Huanle commented 3 years ago

Hi @Stakaitis , Can you let me know more details about what you did, esp. are you running with pre-trained models? IF so, which model did you use? What is the command?

Regarding filtering on coverage/depth, if there is no : in your reference IDs, you can try command: awk -F':' '$3>29' your_prediciton.csv > your_prediction.filt.csv You can also try making all consecutive positions with a coverage >=30: perl -ne '@a = split /,/,$_; @b = split /:/, $a[4]; print $_ if ($b[0]>29 && $b[1]>29 && $b[2]>29 && $b[3]>29 && $b[4]>29 || /Ref/);' your_predicitons.csv > your_prediction.filt.csv

Hope this helps. Let me know if you need further help.

Stakaitis commented 3 years ago

@Huanle,

Thank You for the help - now I'm getting both "unm" and "mod" predictions from the _EpinanoPredict.py. Yes, I'm running with a pre-trained model "rrach.q3.mis3.del3.linear.dump"

My goal is to distinguish m6A modifications in two different cell lines (both WT condition). I want to see both differences and similarities in modification occurrence. At this stage I'm interested whether I can extract m6A modifications or not. If I understand correctly EpiNano is the right tool for that, even though, there could be high error rate in the predictions as @enovoa pointed out.

Here is the file of the commands I've used (maybe it will be useful to others whom also don't have much experiance in command line) Epinano_modification_prediction_from_model.txt

Huanle commented 3 years ago

Hi @Stakaitis ,

thanks for sharing with me your commands. they look correct to me. In the test data folder, there are two run.sh files, which you can refer to in case you will have more, esp. unmodified contrasting samples, in the future.