novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
109 stars 31 forks source link

Some problems about RRACH #100

Closed JeremyQuo closed 2 years ago

JeremyQuo commented 2 years ago

In this paper, I see that you used 5-mers of RRACH to do lots of statistics work. But I dont see any operations in your code about RRACH. Here are my questions about it.

In test_data/make_predictions/run.sh

  1. How do you get your rrach.q3.mis3.del3.linear.dump in line 46? I think there may something missing. what are your train data and did you extact all RRACH data? And if so, how many 5-mers of RRACH you used or the ratio of RRACH in all 5-mer?

  2. I want know if the non-RRACH need to be droped before before run Epinano_Predict.py. Because it seems your cmds and codes do not do that thing or I missed it somewhere. Can you tell me the part you deal with RRACH please?

Huanle commented 2 years ago

Hi @JeremyQuo ,

I am awfully for the late reply.

The RRACH motif can be easily selected with regular expression. /[AG][AG]AC[ACT]/ on forward strand or /[TGA]GT[TC][TC]/ on reverse strand.

1. The rrach.q3.mis3.del3.linear.dump file can be trained using relevant features denoted by the name, aka, the quality score, mismatch frequency, deltetion frequency for the 3rd/middle base in the 5mer. You can follow the steps in the train_models folder to train the model.

2. You need to drop non-RRACH results if the model you used was trained with data containing only RRACH motifs. You can filter it out either before or after making predictions.It does not matter.

Hope this helps and please let me know if you need further help or clarification.

JeremyQuo commented 2 years ago

Dear prof.

thanks for your reply

I get it

Do you use rep1 and rep2 in git to train your model?

I filtered the reach mode but the number of it is not that good. so I wanna the number of rows you used to train your model

Best regards Zhihao

在 2021年10月15日,下午3:13,WHUANLEE @.***> 写道:

 Hi @JeremyQuo ,

I am awfully for the late reply.

The RRACH motif can be easily selected with regular expression. /[AG][AG]AC[ACT]/ on forward strand or /[TGA]GT[TC][TC]/ on reverse strand.

The rrach.q3.mis3.del3.linear.dump file can be trained using relevant features denoted by the name, aka, the quality score, mismatch frequency, deltetion frequency for the 3rd/middle base in the 5mer. You can follow the steps in the train_models folder to train the model.

You need to drop non-RRACH results if the model you used was trained with data containing only RRACH motifs. You can filter it out either before or after making predictions.It does not matter.

Hope this helps and please let me know if you need further help or clarification.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

Huanle commented 2 years ago

Hi @JeremyQuo ,

The models included in this repo were trained with published data (doi: 10.1038/s41467-019-11713-9.) The example data for users to play with is only a small subset of that published one. You can download the whole dataset provided in the manuscript and perform training using the whole dataset. Please let me know if you need more help. All the Best!

JeremyQuo commented 2 years ago

Mant thanks for your answering.

JeremyQuo commented 2 years ago

Actually, I trained a new algorithm and wanna test my algorithms on your data. It works well on example data, so I wanna get all rrach data. I tried to rerun run.sh to generate 5mer.csv from your raw data(https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP174366) SAMN10640338/SAMN10640337 and it comes some new issues.

Here is my command

_bin/guppy_basecaller -c rna_r9.4.1_70bps_hac.cfg --compress_fastq -i ./fast5/ -r -s ./mod_fastq/ --fast5_out -x 'auto' cat /.fast.gz > mod_fastq.gz gizp -d mod.fastq.gz minimap2 --MD -t 6 -ax map-ont cc.fasta mod.fastq | samtools view -hbS -F 3844 - | samtools sort -@ 6 -o mod.bam samtools index mod.bam python ../../Epinano_Variants.py -R cc.fasta -b mod.bam -n 6 -T t -s ../../misc/sam2tsv.jar python ../../misc/Slide_Variants.py ko.plusstrand.per.site.csv 5

The rows of result rows is no more than 10k, which is less than your example data. I wanna know what's the problem of my cmds.

Besides,can you send me your entire 5mer.csv of unmod/mod or tell me the exact number of rows about your RRACH 5-mers.

Many thanks.

Huanle commented 2 years ago

Hi @JeremyQuo , did you combine both mod and unmod data after you got the features organized in 5mer format? I am out of the office so I cannot get the data you are asking for.

JeremyQuo commented 2 years ago

Nop, but I think it will be 20k rows after combination,which is same as sample data in git. However, you answered me that sample data in git a subset,so I wanna obtain more data to train and test. Or it means it is all 5mer features of this raw data?

Huanle commented 2 years ago

Hi @JeremyQuo you to have both mod and unm to do training.