novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
108 stars 31 forks source link

How to get my own train file #7

Closed iaunicorn closed 5 years ago

iaunicorn commented 5 years ago

Hi, I just use your scripts to deal with my ONT-Met dataset a few weeks ago. Fortunately, I got some results, such as a file named per_site.var.current.csv. But what makes me confused is that I have no idea how to get my own train file by using per_site.var.current.csv file. Or something I was wrong about understanding the SVM.py script. Thanks. I really need your help.

Huanle commented 5 years ago

Hi @iaunicorn,

Thanks for using Epinano.

To train your own model(s), l assume that you have a dataset with prior knowledge of modification status, i.e., 'mod' and 'unm', just as the examle input files.

With the provided example files, running the fowllowing commands will generate trained model files using a single feature, i.e., quality scores at the 3rd position of a 5mer.

this cmd will train with sample1 data, and test/make predicitons with sample2 data. '-a' will generate prediciton accuracy with the trained model

python scripts/main/SVM.py -t examples/svm_input/sample1.csv -p examples/svm_input/sample2.csv -cl 3 -mc 11 -a

this cmd will train with sample2, test with sample1

python scripts/main/SVM.py -t examples/svm_input/sample2.csv -p examples/svm_input/sample1.csv -cl 3 -mc 11 -a

this cmd will train and test with the same sample (when you have 'enough' data)

python scripts/main/SVM.py -t examples/svm_input/sample1.csv -p examples/svm_input/sample1.csv -cl 3 -mc 11 -a

Remmeber to play with '-cl' option that allows you to use different feature or combination of features to train with your own data.

I hope this helps and look forward to any further questions you might have.

iaunicorn commented 5 years ago

Hi, thanks. I think I have 'enough' data, which means my csv file is big. But I don't know how to use my data to train if I have not a dataset with prior knowledge of modification status. Could I get a my own dataset with prior knowledge by checking my csv file and labeled the last column of it named "sample" as sample "unmod" manually if the kmer in first column without base "A" ?

Huanle commented 5 years ago

@iaunicorn You cannot arbitrarily label your samples and use them for training. Although you can label motifs containing no 'A' base as 'unm' when you are sure that non-A, i.e., G/C/T bases are not modifed at all, you still lack knowledge of modification status of motifs containing 'A'. So, you should not do training with your files in this scenario.

iaunicorn commented 5 years ago

@Huanle Thanks again. I think this commond "python2 assign_current_to_per_read_kmer.py output.event.tbl.features.readposition.adj.csv > per_read.var.current.csv" may be a minor misunderstanding. Because here is another input file missing if I just run like it. Please give me a definite indication.

iaunicorn commented 5 years ago

@Huanle Hi, in the part of "To extract features from basecalled FASTQ files", if we had changed 'U' to 'T' in second step, I am confused that why do you use mod.h5t3.fastq rather than mod.U2T.fastq to map with minimap2?

Huanle commented 5 years ago

Hi @iaunicorn , Sorryf or the confusion. You are right that you should map the converted reads to the reference sequences.

iaunicorn commented 5 years ago

@Huanle Hi, what is the answer of previous question about the command "python2 assign_current_to_per_read_kmer.py output.event.tbl.features.readposition.adj.csv > per_read.var.current.csv"?

Huanle commented 5 years ago

@iaunicorn, what was missed is obviously the slided per read variants data. BTW, i'd love to remind you to be wary of using current intensity data for your analyses, because this metric is quite variable.