Issues with using Epinano_predict.py - unable to use models to train datasets

novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)

GNU General Public License v2.0

109 stars 31 forks source link

Issues with using Epinano_predict.py - unable to use models to train datasets #114

Closed BioinfoHub-PeiQinNg closed 2 years ago

BioinfoHub-PeiQinNg commented 2 years ago

Hi there,

I am trying to train my current dataset using rrach.q3.mis3.del3.linear.dump, as I only have one sample to test.

Some questions regarding using Epinano_predict.py

1) do I have to generate the 5mer.csv file using Slide_variant.py prior to the Epinano_predict.py step? it does not seem to be very clear to me from the vignette.

2) I have tried running the following command on my samples. /home/ng/localenv/EpiNano/Epinano_Predict.py -m /home/ng/localenv/EpiNano/models/rrach.q3.mis3.del3.linear.dump -p H10_Heads_R1_RNAD.minus_strand.per_site_var.5mer.csv -cl 8,13,23 -o H10_Heads_R1_RNAD.minus.predict

and it returns the following error message: Traceback (most recent call last): File "/home/ng/localenv/EpiNano/Epinano_Predict.py", line 73, in <module> mod_col = int (args['modification_status_column']) - 1 if args['modification_status_column'] else None ValueError: invalid literal for int() with base 10: '/home/ng/localenv/EpiNano/models/rrach.q3.mis3.del3.linear.dump'

I wonder what is the issue with this?

3) Also, I do not seem to have column 13 and 23 in my csv files. (only 11 columns for csv. and 14 columns for 5mer.csv). May I know which columns I should use for the purpose of my analysis? or am I missing something with my data.

Thank you for assisting me with this. Looking forward to hearing from you soon.

Huanle commented 2 years ago

Hi @BioinfoHub-PeiQinNg ,

Thanks for using epinano. Regarding training your own models, there is a tutorial here for your reference.

Regarding organizing feature data (aka question 2.1): you have the total flexibility of arranging your data in different formats, as long as you let the program know the column corresponding to the specific feature(s) that you'd love to use for training or making predictions. For instance, if you want to use only one feature instead of the combination of a given feature at consecutive genomic/transcriptomic positions, there is no need to slide the single site variant feature table into kmer mode. As for your question 2.2: it seems to me you have input wrong column index (number). Can you double-check whether the column numbers you chose actually contain features that you are interested in? Can you show the head of your input file so that I can help to debug it?

Based on your last question: it seems question 2.2 is actually related to the wrong choice of column numbers. The choice of column numbers should be relied on the models that you use to make predictions. For instance, if you are using a model trained on base quality, then you have to choose the column number where the column contains this feature.

I hope this clears up your doubts. Let me know if you need further help.

Best,

BioinfoHub-PeiQinNg commented 2 years ago

Hi Huan,

Thank you so much for getting back to me.

Instead of training my own models, is it possible for me to use the models in EpiNano for my dataset directly?
Please find the my minus_strand.per.site.csv input file head as below:

#Ref,pos,base,strand,cov,q_mean,q_median,q_std,mis,ins,del 2L,9499,T,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9500,A,-,1,20.00000,20.00000,0.00000,0.00000,0.00000,0.00000 2L,9501,G,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9502,G,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9503,G,-,1,22.00000,22.00000,0.00000,0.00000,0.00000,0.00000 2L,9504,A,-,1,11.00000,11.00000,0.00000,0.00000,0.00000,0.00000 2L,9505,C,-,1,16.00000,16.00000,0.00000,0.00000,0.00000,0.00000 2L,9506,G,-,1,19.00000,19.00000,0.00000,0.00000,0.00000,0.00000 2L,9507,T,-,1,15.00000,15.00000,0.00000,0.00000,0.00000,0.00000

Which columns should I use if I am interested in using the EpiNano models to predict m6A modifications?

Thank you. Looking forward to hearing from you soon.

BioinfoHub-PeiQinNg commented 2 years ago

I have also tried training with my own dataset (the one shown above ) for training and prediction.

This is the command line that I have used python ~/localenv/EpiNano/Epinano_Predict.py -t sample.minus_strand.per.site.csv -p sample.minus_strand.per.site.csv -o test_predict -cl 6. Where column 6 is the mean base quality.

The following error came out instead:

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

I am not sure what is the issue this time. Will wait for your reply. Thank you for assisting me with debugging this.

Huanle commented 2 years ago

Hi Huan,

Thank you so much for getting back to me.

Instead of training my own models, is it possible for me to use the models in EpiNano for my dataset directly?

Please find the my minus_strand.per.site.csv input file head as below:

#Ref,pos,base,strand,cov,q_mean,q_median,q_std,mis,ins,del 2L,9499,T,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9500,A,-,1,20.00000,20.00000,0.00000,0.00000,0.00000,0.00000 2L,9501,G,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9502,G,-,1,17.00000,17.00000,0.00000,0.00000,0.00000,0.00000 2L,9503,G,-,1,22.00000,22.00000,0.00000,0.00000,0.00000,0.00000 2L,9504,A,-,1,11.00000,11.00000,0.00000,0.00000,0.00000,0.00000 2L,9505,C,-,1,16.00000,16.00000,0.00000,0.00000,0.00000,0.00000 2L,9506,G,-,1,19.00000,19.00000,0.00000,0.00000,0.00000,0.00000 2L,9507,T,-,1,15.00000,15.00000,0.00000,0.00000,0.00000,0.00000

Which columns should I use if I am interested in using the EpiNano models to predict m6A modifications?

Thank you. Looking forward to hearing from you soon.

Hi @BioinfoHub-PeiQinNg ,

You can try rrach.q3.mis3.del3.linear.dump , aka, the middle base quality, mismatch, and deletion frequencies of the RRACH (remember to filter for only rrach motifs afterwards, that's why data organized in 5mer mode will be helpful for you) motif to predict modifications.

Huanle commented 2 years ago

ValueError: Location based indexing can only have [integer, integer slice (

Can you tell me which line of the code gave you this error? Are you using python3? What is your pandas version (it seems to be related to pandas dataframe)?

BioinfoHub-PeiQinNg commented 2 years ago

Hi Huan,

Thank you so much again for assisting me with this.

In terms of the line number, I have attached the following error message to provide a more detailed perspective: `The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ng/localenv/EpiNano/Epinano_Predict.py", line 90, in Y = df_tmp.iloc[:,mod_col] File "/home/ng/localenv/miniconda3/envs/Epinano1.2/lib/python3.9/site-packages/pandas/core/indexing.py", line 961, in getitem return self._getitem_tuple(key) File "/home/ng/localenv/miniconda3/envs/Epinano1.2/lib/python3.9/site-packages/pandas/core/indexing.py", line 1458, in _getitem_tuple tup = self._validate_tuple_indexer(tup) File "/home/ng/localenv/miniconda3/envs/Epinano1.2/lib/python3.9/site-packages/pandas/core/indexing.py", line 771, in _validate_tuple_indexer raise ValueError( ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types`

The following are the versions of python and pandas which I am using to run EpiNano1.2: Python 3.9.7 Pandas 1.4.1

Hope that helps with the debugging. Looking forward to hearing from you soon.

Huanle commented 2 years ago

Hi @BioinfoHub-PeiQinNg, You have not labeled your samples with modification status when trying to train models. The script yilded errors because it can not find this column. Can you please follow the tutorial and try it again?

Moreover, I also suggest installing recommended version of python and required packages as shortliested in the main readme page.