Closed aman21392 closed 4 years ago
Hi @aman21392 ,
What are you trying to do? Training models? since you switch on '-t', i guess so. -cl will tell the program to use features contained in column 1-5 for training. -mc 11 will tell the program that column 11 contains prior knowledge of modification status. Can you please check whether you do want to do this and whether your input file contain the information as you have speciifed?
Hi,
I have my own data so as you give on wiki page i just follow them and start analysis. i used guppy basecalling fastq files. I show you all command till now i used--
1- samtools faidx homotranscript.fa
2- java -jar picard.jar CreateSequenceDictionary R= homotranscript.fa
3- minimap2 -ax map-ont -t 40 homotranscript.fa combined.fastq | samtools view -@40 -hSb - |
samtools sort -@ 40 -o combined.bam
4- samtools index combined.bam
5- samtools view -h -F 3844 combined.bam | java -jar sam2tsv.jar -r homotranscript.fa > combined.tsv
6- python TSV_to_Variants_Freq.py3 -f combined.tsv -t 40
(there are 2 type of csv file generate combined.tsv.per.site.var.per_site_var.5mer.csv and combined.tsv.per.site.var.csv)
combined.tsv.per.site.var.per_site_var.5mer.csv -this file contain following column ---Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5 (so i used -cl 5-9 and i don't use -mc because there is no prior knowledge about modification status)
7- SVM.py -p combined.tsv.per.site.var.per_site_var.5mer.csv -cl 5-9 -o infect_test
Colunms-used: 5-9 output: combined_test.q1.q2.q3.q4.q5.SVM
Traceback (most recent call last):
File "/home/aclab/apps/EpiNano-epinano1.1.1/scripts/SVM.py", line 118, in
So please tell me now what i wrong this time.
Hi @aman21392 ,
I still do not know exactly what you are trying to do.
Given what you have, it seems you can only make predictions.
But your command
SVM.py -p combined.tsv.per.site.var.per_site_var.5mer.csv -cl 5-9 -o infect_test
does not tell the program the which model to use.
You should provide the program with a model that is trained with 'q1,q2,q3,q4,q5'.
Hi Huanle , I trained my data (combined.tsv.per.site.var.per_site_var.5mer.csv). with your sample1.csv is present in the example file of your epinano program with the following command- python3 SVM.py -a -t epinano/test_sample/sample1.csv -p infected.tsv.per.site.var.per_site_var.5mer.csv -cl 5-9 -mc 11 -o infect_test after getting output result- there are 4 model.dump file creating and 1csv (sigmoid)file. So after getting my trained model i used to predict modification command to get the trained.prediction. using command- python3 SVM.py -a -M infect_test.q1.q2.q3.q4.q5.SVM.sigmoid.model.dump -p infect_test.q1.q2.q3.q4.q5.SVM.kernel.sigmoid.csv -cl 5-9 -mc 11 -o trained.prediction I get one output file from this command and I just wondering that what i get is correct- Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5,prediction,dist,ProbM,ProbU,prediction,dist,ProbM,ProbU
you see there is repeat of prediction,dist,ProbM,ProbU column and all are same in columns as you see below last 8 column-
AGGCA,197:198:199:200:201,ENST00000493034,1.0:1.0:1.0:1.0:1.0,36.0,30.0,26.0,22.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641 TATCG,1072:1073:1074:1075:1076,ENST00000373719,1.0:1.0:1.0:1.0:1.0,15.0,13.0,20.0,13.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472655337213,0.5054877137546756,0.4945122862453243,unm,0.8570472655337213,0.5054877137546756,0.49451228624532434 CCAAT,1990:1991:1992:1993:1994,ENST00000569510,1.0:1.0:1.0:1.0:1.0,29.0,26.0,28.0,30.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641 GCCCC,445:446:447:448:449,ENST00000311549,1.0:1.0:1.0:1.0:1.0,20.0,0.0,24.0,23.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,unm,0.8570472598148626,0.5054877137094933,0.4945122862905064,unm,0.8570472598148626,0.5054877137094933,0.4945122862905064 TGCAT,1721:1722:1723:1724:1725,ENST00000479279,1.0:1.0:1.0:1.0:1.0,11.0,19.0,5.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570510284788138,0.5054877434840569,0.4945122565159431,unm,0.8570510284788138,0.5054877434840569,0.4945122565159431 GGAGA,2065:2066:2067:2068:2069,ENST00000535968,1.0:1.0:1.0:1.0:1.0,0.0,15.0,27.0,24.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,mod,-2255.055870198594,3.00000089999998e-14,0.99999999999997,mod,-2255.055870198594,3.00000089999998e-14,0.9999999999999699 TTCTT,7333:7334:7335:7336:7337,ENST00000552994,1.0:1.0:1.0:1.0:1.0,0.0,0.0,0.0,16.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,mod,-2567.778238882477,3.00000089999998e-14,0.99999999999997,mod,-2567.778238882477,3.00000089999998e-14,0.9999999999999699 TTCCG,300:301:302:303:304,ENST00000324106,1.0:1.0:1.0:1.0:1.0,0.0,19.0,13.0,10.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,mod,-2825.1720076680713,3.00000089999998e-14,0.99999999999997,mod,-2825.172007668071,3.00000089999998e-14,0.9999999999999699
I just want to know that my result is seems correct or somethings wrong. I just ask you this because i trained model with my data and my fastq is guppy basecalling and i saw these line in your readme file- ""If you are using Guppy base-called fast5/fastq, you can still use EpiNano to extract features (i.e. 'errors'), but the SVM predictions (ProbM) will not be accurate."" So it means we can't say my result is accurate. thanks
I think i don't trained my data with your sample1.csv file. Is it correct because when I used prediction command probM and probU is same for all kmer. There is any other way to how i trained my data. Thanks
Hi @aman21392 , If you do not have known modified and unmodifed data, there is no way you can train your own models. sample[12].csv are toy files to play with, not for training. That said, you can use our preatined models in the models folder. If you are keen on tryig out the training commands, you can download our published curlckaes data and go ahead by following the wiki instructions. Hope this helps and i look forward to helping more.
I think i don't trained my data with your sample1.csv file. Is it correct because when I used prediction command probM and probU is same for all kmer. There is any other way to how i trained my data. Thanks
you can do prediction and command simutaneously.
Hi Huanle , I trained my data (combined.tsv.per.site.var.per_site_var.5mer.csv). with your sample1.csv is present in the example file of your epinano program with the following command- python3 SVM.py -a -t epinano/test_sample/sample1.csv -p infected.tsv.per.site.var.per_site_var.5mer.csv -cl 5-9 -mc 11 -o infect_test after getting output result- there are 4 model.dump file creating and 1csv (sigmoid)file. So after getting my trained model i used to predict modification command to get the trained.prediction. using command- python3 SVM.py -a -M infect_test.q1.q2.q3.q4.q5.SVM.sigmoid.model.dump -p infect_test.q1.q2.q3.q4.q5.SVM.kernel.sigmoid.csv -cl 5-9 -mc 11 -o trained.prediction I get one output file from this command and I just wondering that what i get is correct- Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5,prediction,dist,ProbM,ProbU,prediction,dist,ProbM,ProbU
you see there is repeat of prediction,dist,ProbM,ProbU column and all are same in columns as you see below last 8 column-
Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5,prediction,dist,ProbM,ProbU,prediction,dist,ProbM,ProbU
AGGCA,197:198:199:200:201,ENST00000493034,1.0:1.0:1.0:1.0:1.0,36.0,30.0,26.0,22.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641 TATCG,1072:1073:1074:1075:1076,ENST00000373719,1.0:1.0:1.0:1.0:1.0,15.0,13.0,20.0,13.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472655337213,0.5054877137546756,0.4945122862453243,unm,0.8570472655337213,0.5054877137546756,0.49451228624532434 CCAAT,1990:1991:1992:1993:1994,ENST00000569510,1.0:1.0:1.0:1.0:1.0,29.0,26.0,28.0,30.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641,unm,0.8570472598075867,0.5054877137094359,0.4945122862905641 GCCCC,445:446:447:448:449,ENST00000311549,1.0:1.0:1.0:1.0:1.0,20.0,0.0,24.0,23.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,unm,0.8570472598148626,0.5054877137094933,0.4945122862905064,unm,0.8570472598148626,0.5054877137094933,0.4945122862905064 TGCAT,1721:1722:1723:1724:1725,ENST00000479279,1.0:1.0:1.0:1.0:1.0,11.0,19.0,5.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,unm,0.8570510284788138,0.5054877434840569,0.4945122565159431,unm,0.8570510284788138,0.5054877434840569,0.4945122565159431 GGAGA,2065:2066:2067:2068:2069,ENST00000535968,1.0:1.0:1.0:1.0:1.0,0.0,15.0,27.0,24.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,mod,-2255.055870198594,3.00000089999998e-14,0.99999999999997,mod,-2255.055870198594,3.00000089999998e-14,0.9999999999999699 TTCTT,7333:7334:7335:7336:7337,ENST00000552994,1.0:1.0:1.0:1.0:1.0,0.0,0.0,0.0,16.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,mod,-2567.778238882477,3.00000089999998e-14,0.99999999999997,mod,-2567.778238882477,3.00000089999998e-14,0.9999999999999699 TTCCG,300:301:302:303:304,ENST00000324106,1.0:1.0:1.0:1.0:1.0,0.0,19.0,13.0,10.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,mod,-2825.1720076680713,3.00000089999998e-14,0.99999999999997,mod,-2825.172007668071,3.00000089999998e-14,0.9999999999999699
I just want to know that my result is seems correct or somethings wrong. I just ask you this because i trained model with my data and my fastq is guppy basecalling and i saw these line in your readme file- ""If you are using Guppy base-called fast5/fastq, you can still use EpiNano to extract features (i.e. 'errors'), but the SVM predictions (ProbM) will not be accurate."" So it means we can't say my result is accurate. thanks
I used guppy based calling software to get fast5 and fastq. I used new release of epinano 1.1.1 version so i don't understand why it give this like error. please suggest me to solve this problem. Commad: SVM.py -a -t infected.tsv.per.site.var.per_site_var.5mer.csv -p infected.tsv.per.site.var.per_site_var.5mer.csv -cl 1-5 -mc 11 -o infect_test
Colunms-used: 1-5 output: infect_test.#Kmer.Window.Ref.Coverage.q1.SVM Traceback (most recent call last): File "/home/aclab/apps/EpiNano-epinano1.1.1/scripts/SVM.py", line 153, in
model_fit = model.fit (X_train, y_train)
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 149, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 573, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: '1.0:1.0:1.0:1.0:1.0'
In advance thank you.