novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
108 stars 31 forks source link

The accuracy of Epinano_Predict.py is 50% #58

Closed emanlee closed 3 years ago

emanlee commented 3 years ago

Hi, there

There might be something wrong with our steps. We appreciate your help!

We got a file named [combined.per_site_raw_feature.5mer-v3.csv.txt] using the demo data from EpiNano 1.2. Please see the attached file.

combined.per_site_raw_feature.5mer-v3.csv.txt contains the following 36 columns:

Kmer | Window | Ref | Strand | Coverage | q1 | q2 | q3 | q4 | q5 | mis1 | mis2 | mis3 | mis4 | mis5 | ins1 | ins2 | ins3 | ins4 | ins5 | del1 | del2 | del3 | del4 | del5 | I1 | I2 | I3 | I4 | I5 | D1 | D2 | D3 | D4 | D5 | sample

Then, we ran the following command:

python3.8 /EpiNano-master/Epinano_Predict.py \ --train combined.per_site_raw_feature.5mer-v3.csv.txt \ --predict combined.per_site_raw_feature.5mer-v3.csv.txt \ --accuracy_estimation --out_prefix train_and_test \ --columns 8,13,23 --modification_status_column 36

We got train_and_test.q3.mis3.del3.MODEL.best-kernel.poly.accuracy

more train_and_test.q3.mis3.del3.MODEL.best-kernel.poly.accuracy

Best accuracy 50.324675324675326 % obtained with kernel = poly 49.61629279811098 % accuracy obtained with kernel = linear 49.79338842975206 % accuracy obtained with kernel = rbf 49.85242030696576 % accuracy obtained with kernel = sigmoid

combined.per_site_raw_feature.5mer-v3.csv.txt

Huanle commented 3 years ago

Hi @emanlee , Sorry for my late reply.

But i need more details.

Which version of epinano are you using?
What dataset did you use? Your own data or the example test data?
If you are using your own data. Are you completely sure about the modification status (prior knowledge required for traiing/testing) in your data?
How did you get combined.per_site_raw_feature.5mer-v3.csv.txt? 
Did you follow the commands in **train_models/train_test.sh**?
Can you show me your commands?
Have you tried training and testing with other features? 

Best, Huanle

emanlee commented 3 years ago

Thanks a lot for your response!

Which version of epinano are you using? -- EpiNano 1.2

What dataset did you use? Your own data or the example test data? -- demo datasets from EpiNano 1.2

Can you show me your commands? -- We used demo datasets provided by EpiNano 1.2. Our steps followed Readme and Run.sh. All steps are as follows:

python3.8 /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/Epinano_Variants.py -t 6 -R /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/ref.fa -b /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/wt.bam -s /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/misc/sam2tsv.jar --type t

python3.8 /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/Epinano_Variants.py -t 6 -R /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/ref.fa -b /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/ko.bam -s /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/misc/sam2tsv.jar --type t

/home/aimin/NanoRNAedit/EpiNano/EpiNano-master/Epinano_Current.sh -b /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/wt.bam -r /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/wt.fastq -f /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/ref.fa -t 6 -m t -d /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/wt_fast5/

/home/aimin/NanoRNAedit/EpiNano/EpiNano-master/Epinano_Current.sh -b /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/ko.bam -r /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/ko.fastq -f /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/ref.fa -t 6 -m t -d /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/test_data/make_predictions/ko_fast5/

python3.8 ../../misc/Join_variants_currents.py --variants wt.plus_strand.per_site.5mer.csv --intensity wt.eventalign.tsv.gz.forward_events.collapsed/Intensity.collapsed.tsv.5mer.csv --outfile wt.err.and.intensity.5mer.csv

python3.8 ../../misc/Join_variants_currents.py --variants ko.plus_strand.per_site.5mer.csv --intensity ko.eventalign.tsv.gz.forward_events.collapsed/Intensity.collapsed.tsv.5mer.csv --outfile ko.err.and.intensity.5mer.csv

wc -l ko.err.and.intensity.5mer.csv head -3389 ko.err.and.intensity.5mer.csv > ko.err.and.intensity.5mer-v2.csv

wc -l wt.err.and.intensity.5mer.csv head -3389 wt.err.and.intensity.5mer.csv > wt.err.and.intensity.5mer-v2.csv

sh ../../misc/Epinano_LabelSamples.sh -m ko.err.and.intensity.5mer-v2.csv -u wt.err.and.intensity.5mer-v2.csv -o combined.per_site_raw_feature.5mer-v2.csv

sed 's/mod/1/g' combined.per_site_raw_feature.5mer-v2.csv | sed 's/unm/0/g' > combined.per_site_raw_feature.5mer-v3.csv

python3.8 -m pdb /home/aimin/NanoRNAedit/EpiNano/EpiNano-master/Epinano_Predict.py \ --train combined.per_site_raw_feature.5mer-v3.csv \ --predict combined.per_site_raw_feature.5mer-v3.csv \ --accuracy_estimation --out_prefix train_and_test \ --columns 8,13,23 --modification_status_column 36

Huanle commented 3 years ago

Hi @emanlee,

There is nothing wrong with your commands. But before you proceeded to the training and testing commands, you labeled the modification stautus wrong!!

As it says in the readme file, this example data has only 2 pseudo U modifications at sites 2826 and 2880, while you labeled all sites from the wild type sample as being modified. This will definitely fool the algorithm.

This small data set, in terms of both total number of sites and previously known modified sites, is not appropritate for training models.

I included it in the make_predictions folder simply for the users to quickly run commands in run.sh through. Another thing i want to demenstrate is that a model trained with delta- feature(s) and sum-err feature(s) tend to be more generall applicable. That is to say, even though you trained your models with m6A modifications. You can still apply them to predict alternative types of modifications (e.g., psU) and very likely (i did not test it intensively with many enough types of modifications) acurate enough. I have metioned this in the Wiki.

If you are interested in training models and assesing models' accuracies, please use data in train_models folder. It is also a small dataset, however, it has all A sites modified in the wild type sample.

Please let me know if this clears up your concern.

emanlee commented 3 years ago

Hi, Dr. Huanle Liu

Thanks a lot for your kind reply!

Another concern is about this file combined.per_site_raw_feature.5mer-v3.csv

The first column is k-mer, such as GTTTG, TTTGA, TTGAC, TGACC, CAAAT, etc. Some 5-mers do not have 'A' in the third place, i.e. GTTTG. And some 5-mers have several 'A's, i.e. CAAAT. If we use these 5-mers as training samples, "--columns 8,13,23" as features, we might not get what we want.

Huanle commented 3 years ago

Hi, Dr. Huanle Liu

Thanks a lot for your kind reply!

Another concern is about this file combined.per_site_raw_feature.5mer-v3.csv

The first column is k-mer, such as GTTTG, TTTGA, TTGAC, TGACC, CAAAT, etc. Some 5-mers do not have 'A' in the third place, i.e. GTTTG. And some 5-mers have several 'A's, i.e. CAAAT. If we use these 5-mers as training samples, "--columns 8,13,23" as features, we might not get what we want.

Hi @emanlee , Feel free to to filter out the non-RRACH kerms before or after making predictions, if you are using a model trained with RRACH motifs. -columns should change according to what features you used to train the model.
Just give the numebrs coressponding to the columns containing the relevant features.