novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)

GNU General Public License v2.0

108 stars 31 forks source link

explanation on output #51

Closed DelphIONe closed 4 years ago

DelphIONe commented 4 years ago

Hi,

Sorry but I try to use EpiNano and I am unable to understand what is the .tsv.per.site.var.csv or the .tsv.per.site.var.per_site_var.5mer.csv (output of your TSV_to_Variants_Freq.py3 script) There is one in kmer term, an the other by position but it's by read ? How to have a coverage superior to 1 ? Where are the complete result, the real "per site" result ?

Can you explain please ? I have followed your wiki but ... it's not enough. You advise to use a control sample and a modified sample for your Epinano-Error pipeline but how ??

Thanks for your help,

Huanle commented 4 years ago

Hi @DelphIONe , The per.site.var.csv file contains quality score, mismatch and indel frequncies correspondent to each reference position (site). The per_site_var.5mer.csv file simply reorganize the information of the foremer file in kmer style. The reason we generated the *per_site_var.5mer.csv (or any other kmer size) file is because the size of nanopore can accommodate 5/6 nucleotides, which together affect the electtrical signal during sequencing. In another word, these 5 bases are not independent from each other and better to be tacklled altogether.

Since the output contains coverage information, you could use any threshold and filter it accordingly.

The Epinnao-Error method will be released within a couple of days with Epinano1.2. There will be example data and commands to show its usage.

Thanks for using Epinano and I hope the above explanation is helpful. Please let me know if you think i can help more. Best, Huanle

snajder-r commented 4 years ago

I have a follow-up question: Is it correct to use the per.site.var.csv file when trying to do predictions with the model2.1-mis3.del3.q3.poly.dump model? It seems this model uses the mismatch frequency, deletion frequency, and mean quality of the center (the third) base of the five-mer. That's the same as the features in the per.site.var.csv file, right?

What is the correct way to provide the columns? SVM.py requires me to give the column numbers for the three features. In the csv file, the column names are:

Ref,pos,base,cov,q_mean,q_median,q_std,mis,ins,del

Since the model lists the features in the order "mis, del, q" would the correct way to call SVM.py be with "-c 8,10,5"?

I've tried using the model with the 5-mer features, but unfortunately the feature generating script for 5-mers doesn't work as it has an extremely high memory requirement (with only about 200k reads, 48GB of RAM is still not enough and it runs out of memory).

Huanle commented 4 years ago

Hi @DelphIONe,

You should organize your feaure table in 5mers if you want to use the pre-trained model. slide_per_site_var.py should do the job for you. What's the size of your input file? As for the -cl argument, you should choose the column number (1-based) in the slided feature table and corresponding to the features used for training.

I do not expect there would be any RAM asscociated problem with this specific operation given that the script store and process k(user defined)-lines at any time. If you do not mind, can you send me your input file so that i can have a test?

Huanle commented 4 years ago

Hi @DelphIONe,

It is ok to use -cl 5,8,10, if using no-slided feature table for input. Sorry i mistook your purpose and thought you would like to use features from consecutive sites.