novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
110 stars 31 forks source link

Regarding Epinano command issue #48

Closed aman21392 closed 4 years ago

aman21392 commented 4 years ago

I used guppy based calling software to get fast5 and fastq. I used new release of epinano 1.1.1 version so i don't understand why it give this like error. please suggest me to solve this problem. Commad: SVM.py -a -t infected.tsv.per.site.var.per_site_var.5mer.csv -p infected.tsv.per.site.var.per_site_var.5mer.csv -cl 1-5 -mc 11 -o infect_test

Colunms-used: 1-5 output: infect_test.#Kmer.Window.Ref.Coverage.q1.SVM Traceback (most recent call last): File "/home/aclab/apps/EpiNano-epinano1.1.1/scripts/SVM.py", line 153, in model_fit = model.fit (X_train, y_train) File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 149, in fit X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr') File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 573, in check_X_y ensure_min_features, warn_on_dtype, estimator) File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 433, in check_array array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: '1.0:1.0:1.0:1.0:1.0'

In advance thank you.

Huanle commented 4 years ago

Hi @aman21392 ,

What are you trying to do? Training models? since you switch on '-t', i guess so. -cl will tell the program to use features contained in column 1-5 for training. -mc 11 will tell the program that column 11 contains prior knowledge of modification status. Can you please check whether you do want to do this and whether your input file contain the information as you have speciifed?

aman21392 commented 4 years ago

Hi, I have my own data so as you give on wiki page i just follow them and start analysis. i used guppy basecalling fastq files. I show you all command till now i used-- 1- samtools faidx homotranscript.fa 2- java -jar picard.jar CreateSequenceDictionary R= homotranscript.fa 3- minimap2 -ax map-ont -t 40 homotranscript.fa combined.fastq | samtools view -@40 -hSb - | samtools sort -@ 40 -o combined.bam 4- samtools index combined.bam 5- samtools view -h -F 3844 combined.bam | java -jar sam2tsv.jar -r homotranscript.fa > combined.tsv 6- python TSV_to_Variants_Freq.py3 -f combined.tsv -t 40 (there are 2 type of csv file generate combined.tsv.per.site.var.per_site_var.5mer.csv and combined.tsv.per.site.var.csv) combined.tsv.per.site.var.per_site_var.5mer.csv -this file contain following column ---Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5 (so i used -cl 5-9 and i don't use -mc because there is no prior knowledge about modification status) 7- SVM.py -p combined.tsv.per.site.var.per_site_var.5mer.csv -cl 5-9 -o infect_test Colunms-used: 5-9 output: combined_test.q1.q2.q3.q4.q5.SVM Traceback (most recent call last): File "/home/aclab/apps/EpiNano-epinano1.1.1/scripts/SVM.py", line 118, in Xtrain, , ytrain, , indicestrain, = train_test_split(X,Y.values.ravel(), indices, test_size=0, random_state= 100) File "/usr/lib/python3/dist-packages/sklearn/model_selection/_split.py", line 2056, in train_test_split train, test = next(cv.split(X=arrays[0], y=stratify)) File "/usr/lib/python3/dist-packages/sklearn/model_selection/_split.py", line 1204, in split for train, test in self._iter_indices(X, y, groups): File "/usr/lib/python3/dist-packages/sklearn/model_selection/_split.py", line 1304, in _iter_indices self.train_size) File "/usr/lib/python3/dist-packages/sklearn/model_selection/_split.py", line 1680, in _validate_shuffle_split 'samples %d' % (test_size, n_samples)) ValueError: test_size=0 should be smaller than the number of samples 0

So please tell me now what i wrong this time.

Huanle commented 4 years ago

Hi @aman21392 , I still do not know exactly what you are trying to do. Given what you have, it seems you can only make predictions. But your command SVM.py -p combined.tsv.per.site.var.per_site_var.5mer.csv -cl 5-9 -o infect_test does not tell the program the which model to use. You should provide the program with a model that is trained with 'q1,q2,q3,q4,q5'.

aman21392 commented 4 years ago

Hi Huanle , I trained my data (combined.tsv.per.site.var.per_site_var.5mer.csv). with your sample1.csv is present in the example file of your epinano program with the following command- python3 SVM.py -a -t epinano/test_sample/sample1.csv -p infected.tsv.per.site.var.per_site_var.5mer.csv -cl 5-9 -mc 11 -o infect_test after getting output result- there are 4 model.dump file creating and 1csv (sigmoid)file. So after getting my trained model i used to predict modification command to get the trained.prediction. using command- python3 SVM.py -a -M infect_test.q1.q2.q3.q4.q5.SVM.sigmoid.model.dump -p infect_test.q1.q2.q3.q4.q5.SVM.kernel.sigmoid.csv -cl 5-9 -mc 11 -o trained.prediction I get one output file from this command and I just wondering that what i get is correct- Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5,prediction,dist,ProbM,ProbU,prediction,dist,ProbM,ProbU

you see there is repeat of prediction,dist,ProbM,ProbU column and all are same in columns as you see below last 8 column-

Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5,prediction,dist,ProbM,ProbU,prediction,dist,ProbM,ProbU

AGGCA,197:198:199:200:201,ENST00000493034,1.0:1.0:1.0:1.0:1.0,36.0,30.0,26.0,22.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641 TATCG,1072:1073:1074:1075:1076,ENST00000373719,1.0:1.0:1.0:1.0:1.0,15.0,13.0,20.0,13.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472655337213,0.5054877137546756,0.4945122862453243,unm,0.8570472655337213,0.5054877137546756,0.49451228624532434 CCAAT,1990:1991:1992:1993:1994,ENST00000569510,1.0:1.0:1.0:1.0:1.0,29.0,26.0,28.0,30.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641 GCCCC,445:446:447:448:449,ENST00000311549,1.0:1.0:1.0:1.0:1.0,20.0,0.0,24.0,23.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,unm,0.8570472598148626,0.5054877137094933,0.4945122862905064,unm,0.8570472598148626,0.5054877137094933,0.4945122862905064 TGCAT,1721:1722:1723:1724:1725,ENST00000479279,1.0:1.0:1.0:1.0:1.0,11.0,19.0,5.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570510284788138,0.5054877434840569,0.4945122565159431,unm,0.8570510284788138,0.5054877434840569,0.4945122565159431 GGAGA,2065:2066:2067:2068:2069,ENST00000535968,1.0:1.0:1.0:1.0:1.0,0.0,15.0,27.0,24.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,mod,-2255.055870198594,3.00000089999998e-14,0.99999999999997,mod,-2255.055870198594,3.00000089999998e-14,0.9999999999999699 TTCTT,7333:7334:7335:7336:7337,ENST00000552994,1.0:1.0:1.0:1.0:1.0,0.0,0.0,0.0,16.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,mod,-2567.778238882477,3.00000089999998e-14,0.99999999999997,mod,-2567.778238882477,3.00000089999998e-14,0.9999999999999699 TTCCG,300:301:302:303:304,ENST00000324106,1.0:1.0:1.0:1.0:1.0,0.0,19.0,13.0,10.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,mod,-2825.1720076680713,3.00000089999998e-14,0.99999999999997,mod,-2825.172007668071,3.00000089999998e-14,0.9999999999999699

I just want to know that my result is seems correct or somethings wrong. I just ask you this because i trained model with my data and my fastq is guppy basecalling and i saw these line in your readme file- ""If you are using Guppy base-called fast5/fastq, you can still use EpiNano to extract features (i.e. 'errors'), but the SVM predictions (ProbM) will not be accurate."" So it means we can't say my result is accurate. thanks

aman21392 commented 4 years ago

I think i don't trained my data with your sample1.csv file. Is it correct because when I used prediction command probM and probU is same for all kmer. There is any other way to how i trained my data. Thanks

Huanle commented 4 years ago

Hi @aman21392 , If you do not have known modified and unmodifed data, there is no way you can train your own models. sample[12].csv are toy files to play with, not for training. That said, you can use our preatined models in the models folder. If you are keen on tryig out the training commands, you can download our published curlckaes data and go ahead by following the wiki instructions. Hope this helps and i look forward to helping more.

Huanle commented 4 years ago

I think i don't trained my data with your sample1.csv file. Is it correct because when I used prediction command probM and probU is same for all kmer. There is any other way to how i trained my data. Thanks

you can do prediction and command simutaneously.

Huanle commented 4 years ago

Hi Huanle , I trained my data (combined.tsv.per.site.var.per_site_var.5mer.csv). with your sample1.csv is present in the example file of your epinano program with the following command- python3 SVM.py -a -t epinano/test_sample/sample1.csv -p infected.tsv.per.site.var.per_site_var.5mer.csv -cl 5-9 -mc 11 -o infect_test after getting output result- there are 4 model.dump file creating and 1csv (sigmoid)file. So after getting my trained model i used to predict modification command to get the trained.prediction. using command- python3 SVM.py -a -M infect_test.q1.q2.q3.q4.q5.SVM.sigmoid.model.dump -p infect_test.q1.q2.q3.q4.q5.SVM.kernel.sigmoid.csv -cl 5-9 -mc 11 -o trained.prediction I get one output file from this command and I just wondering that what i get is correct- Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5,prediction,dist,ProbM,ProbU,prediction,dist,ProbM,ProbU

you see there is repeat of prediction,dist,ProbM,ProbU column and all are same in columns as you see below last 8 column-

Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5,prediction,dist,ProbM,ProbU,prediction,dist,ProbM,ProbU

AGGCA,197:198:199:200:201,ENST00000493034,1.0:1.0:1.0:1.0:1.0,36.0,30.0,26.0,22.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641 TATCG,1072:1073:1074:1075:1076,ENST00000373719,1.0:1.0:1.0:1.0:1.0,15.0,13.0,20.0,13.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472655337213,0.5054877137546756,0.4945122862453243,unm,0.8570472655337213,0.5054877137546756,0.49451228624532434 CCAAT,1990:1991:1992:1993:1994,ENST00000569510,1.0:1.0:1.0:1.0:1.0,29.0,26.0,28.0,30.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641 GCCCC,445:446:447:448:449,ENST00000311549,1.0:1.0:1.0:1.0:1.0,20.0,0.0,24.0,23.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,unm,0.8570472598148626,0.5054877137094933,0.4945122862905064,unm,0.8570472598148626,0.5054877137094933,0.4945122862905064 TGCAT,1721:1722:1723:1724:1725,ENST00000479279,1.0:1.0:1.0:1.0:1.0,11.0,19.0,5.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570510284788138,0.5054877434840569,0.4945122565159431,unm,0.8570510284788138,0.5054877434840569,0.4945122565159431 GGAGA,2065:2066:2067:2068:2069,ENST00000535968,1.0:1.0:1.0:1.0:1.0,0.0,15.0,27.0,24.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,mod,-2255.055870198594,3.00000089999998e-14,0.99999999999997,mod,-2255.055870198594,3.00000089999998e-14,0.9999999999999699 TTCTT,7333:7334:7335:7336:7337,ENST00000552994,1.0:1.0:1.0:1.0:1.0,0.0,0.0,0.0,16.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,mod,-2567.778238882477,3.00000089999998e-14,0.99999999999997,mod,-2567.778238882477,3.00000089999998e-14,0.9999999999999699 TTCCG,300:301:302:303:304,ENST00000324106,1.0:1.0:1.0:1.0:1.0,0.0,19.0,13.0,10.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,mod,-2825.1720076680713,3.00000089999998e-14,0.99999999999997,mod,-2825.172007668071,3.00000089999998e-14,0.9999999999999699

I just want to know that my result is seems correct or somethings wrong. I just ask you this because i trained model with my data and my fastq is guppy basecalling and i saw these line in your readme file- ""If you are using Guppy base-called fast5/fastq, you can still use EpiNano to extract features (i.e. 'errors'), but the SVM predictions (ProbM) will not be accurate."" So it means we can't say my result is accurate. thanks