Closed BioinfoHub-PeiQinNg closed 2 years ago
Hi @BioinfoHub-PeiQinNg ,
Thanks for using epinano. Regarding training your own models, there is a tutorial here for your reference.
Regarding organizing feature data (aka question 2.1): you have the total flexibility of arranging your data in different formats, as long as you let the program know the column corresponding to the specific feature(s) that you'd love to use for training or making predictions. For instance, if you want to use only one feature instead of the combination of a given feature at consecutive genomic/transcriptomic positions, there is no need to slide the single site variant feature table into kmer mode. As for your question 2.2: it seems to me you have input wrong column index (number). Can you double-check whether the column numbers you chose actually contain features that you are interested in? Can you show the head of your input file so that I can help to debug it?
Based on your last question: it seems question 2.2 is actually related to the wrong choice of column numbers. The choice of column numbers should be relied on the models that you use to make predictions. For instance, if you are using a model trained on base quality, then you have to choose the column number where the column contains this feature.
I hope this clears up your doubts. Let me know if you need further help.
Best,
Hi Huan,
Thank you so much for getting back to me.
Instead of training my own models, is it possible for me to use the models in EpiNano for my dataset directly?
Please find the my minus_strand.per.site.csv
input file head
as below:
#Ref,pos,base,strand,cov,q_mean,q_median,q_std,mis,ins,del 2L,9499,T,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9500,A,-,1,20.00000,20.00000,0.00000,0.00000,0.00000,0.00000 2L,9501,G,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9502,G,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9503,G,-,1,22.00000,22.00000,0.00000,0.00000,0.00000,0.00000 2L,9504,A,-,1,11.00000,11.00000,0.00000,0.00000,0.00000,0.00000 2L,9505,C,-,1,16.00000,16.00000,0.00000,0.00000,0.00000,0.00000 2L,9506,G,-,1,19.00000,19.00000,0.00000,0.00000,0.00000,0.00000 2L,9507,T,-,1,15.00000,15.00000,0.00000,0.00000,0.00000,0.00000
Which columns should I use if I am interested in using the EpiNano models to predict m6A modifications?
Thank you. Looking forward to hearing from you soon.
I have also tried training with my own dataset (the one shown above ) for training and prediction.
This is the command line that I have used python ~/localenv/EpiNano/Epinano_Predict.py -t sample.minus_strand.per.site.csv -p sample.minus_strand.per.site.csv -o test_predict -cl 6
. Where column 6 is the mean base quality.
The following error came out instead:
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
I am not sure what is the issue this time. Will wait for your reply. Thank you for assisting me with debugging this.
Hi Huan,
Thank you so much for getting back to me.
- Instead of training my own models, is it possible for me to use the models in EpiNano for my dataset directly?
- Please find the my
minus_strand.per.site.csv
input filehead
as below:
#Ref,pos,base,strand,cov,q_mean,q_median,q_std,mis,ins,del 2L,9499,T,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9500,A,-,1,20.00000,20.00000,0.00000,0.00000,0.00000,0.00000 2L,9501,G,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9502,G,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9503,G,-,1,22.00000,22.00000,0.00000,0.00000,0.00000,0.00000 2L,9504,A,-,1,11.00000,11.00000,0.00000,0.00000,0.00000,0.00000 2L,9505,C,-,1,16.00000,16.00000,0.00000,0.00000,0.00000,0.00000 2L,9506,G,-,1,19.00000,19.00000,0.00000,0.00000,0.00000,0.00000 2L,9507,T,-,1,15.00000,15.00000,0.00000,0.00000,0.00000,0.00000
Which columns should I use if I am interested in using the EpiNano models to predict m6A modifications?
Thank you. Looking forward to hearing from you soon.
Hi @BioinfoHub-PeiQinNg ,
You can try rrach.q3.mis3.del3.linear.dump
, aka, the middle base quality, mismatch, and deletion frequencies of the RRACH (remember to filter for only rrach motifs afterwards, that's why data organized in 5mer mode will be helpful for you) motif to predict modifications.
ValueError: Location based indexing can only have [integer, integer slice (
Can you tell me which line of the code gave you this error? Are you using python3? What is your pandas version (it seems to be related to pandas dataframe)?
Hi Huan,
Thank you so much again for assisting me with this.
Traceback (most recent call last):
File "/home/ng/localenv/EpiNano/Epinano_Predict.py", line 90, in
Hope that helps with the debugging. Looking forward to hearing from you soon.
Hi @BioinfoHub-PeiQinNg, You have not labeled your samples with modification status when trying to train models. The script yilded errors because it can not find this column. Can you please follow the tutorial and try it again?
Moreover, I also suggest installing recommended version of python and required packages as shortliested in the main readme page.
Hi there,
I am trying to train my current dataset using
rrach.q3.mis3.del3.linear.dump
, as I only have one sample to test.Some questions regarding using
Epinano_predict.py
1) do I have to generate the 5mer.csv file using
Slide_variant.py
prior to theEpinano_predict.py
step? it does not seem to be very clear to me from the vignette.2) I have tried running the following command on my samples.
/home/ng/localenv/EpiNano/Epinano_Predict.py -m /home/ng/localenv/EpiNano/models/rrach.q3.mis3.del3.linear.dump -p H10_Heads_R1_RNAD.minus_strand.per_site_var.5mer.csv -cl 8,13,23 -o H10_Heads_R1_RNAD.minus.predict
and it returns the following error message:
Traceback (most recent call last): File "/home/ng/localenv/EpiNano/Epinano_Predict.py", line 73, in <module> mod_col = int (args['modification_status_column']) - 1 if args['modification_status_column'] else None ValueError: invalid literal for int() with base 10: '/home/ng/localenv/EpiNano/models/rrach.q3.mis3.del3.linear.dump'
I wonder what is the issue with this?
3) Also, I do not seem to have column 13 and 23 in my csv files. (only 11 columns for csv. and 14 columns for 5mer.csv). May I know which columns I should use for the purpose of my analysis? or am I missing something with my data.
Thank you for assisting me with this. Looking forward to hearing from you soon.