novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
110 stars 31 forks source link

SVM.py index out of bounds #46

Closed istolarek closed 4 years ago

istolarek commented 4 years ago

Dear all,

when trying to use Epinano (all versions) with the built model on the test sample your code returns the error:

Commad:  SVM.py -a -M M6A.mis3.del3.q3.poly.dump -p sample1.csv -cl 7,12,22 -mc 28 -o pretrained.prediction
Traceback (most recent call last):
  File "SVM.py", line 95, in <module>
    names = list (predict_df.columns[cols])
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 3940, in __getitem__
    result = getitem(key)
IndexError: index 11 is out of bounds for axis 0 with size 11

Kinds regards

Huanle commented 4 years ago

@istolarek ,thanks for using epinano. do you mind sharing with me you input file?

istolarek commented 4 years ago

The error happened when using the example files provided with the package (any combination, the same error with this command).

śr., 1 kwi 2020 o 20:21 WHUANLEE notifications@github.com napisał(a):

@istolarek https://github.com/istolarek ,thanks for using epinano. do you mind sharing with me you input file?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/enovoa/EpiNano/issues/46#issuecomment-607414160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHRJ47DU6TWARIYZSWA6QDRKOA3NANCNFSM4LY5CYZQ .

--

Ireneusz Stolarek (Irek)

Phone UK: +44 (0) 7951552811 *Phone Poland: +48 (0) 791935581 *[current site] Phone Netherlands: +31 (0) 623174657

skype: ireneusz.stolarek

View my profile on Linkedin:

[image: https://pl.linkedin.com/in/stolarekir] https://pl.linkedin.com/in/stolarekir

Huanle commented 4 years ago

Hi @istolarek , sorry for the late reply. The error arose because the input feature table has 11 columns. While the command you used told the program to find features in columns 7,12, 22 and modification status information in 28.

The example sample[12].csv files can be used to play with SVM.py to train a model and make predictions. For instance, python3.6 SVM.py -a -p sample1.csv -t sample2.csv -cl 3,8 -mc 11 -o train_and_predict this commands will train (-t) a model with sampl2.csv and then make predictions (-p) with sample1.csv. Since modification status is already known in sample1.tsv, you can also estimate prediciton accuracy (-a). The features used for training are from column 3 and 8, which are q3 and mis3.

Of course you can directly make predictions with already trained models (-M) but you have to make sure you do have the correspondent features in your input file.

Hope this helps. I look forward to helping more if possible.

istolarek commented 4 years ago

Great!

I'm most interested in applying the already trained model on my sample (see attached). Could you point me to the columns from that file, that I need to pass in the command?

Also how good/bad this approach will be since the file I sent you is from virus RNA sequencing. Or maybe what are the alternatives?

""""""""""""""""""""

Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5

TAAAG,9837:9838:9839:9840:9841,NC_045512.2,197.0:197.0:197.0:197.0:197.0,14.45596,17.72340,19.35052,21.71429,21.13706,0.015228426395939087,0.01015228426395939,0.005076142131979695,0.01015228426395939,0.01 015228426395939,0.08629441624365482,0.04568527918781726,0.015228426395939087,0.02030456852791878,0.04568527918781726,0.02030456852791878,0.04568527918781726,0.015228426395939087,0.005076142131979695,0.0 CAGTT,5901:5902:5903:5904:5905,NC_045512.2,451.0:451.0:451.0:451.0:451.0,17.62971,15.37054,24.81111,22.24306,18.89116,0.037694013303769404,0.11086474501108648,0.0066518847006651885,0.03325942350332594,0.0 19955654101995565,0.04656319290465632,0.03325942350332594,0.024390243902439025,0.015521064301552107,0.008869179600886918,0.0,0.0066518847006651885,0.0022172949002217295,0.04212860310421286,0.0221729490022 17297 AATGT,7088:7089:7090:7091:7092,NC_045512.2,276.0:276.0:277.0:277.0:278.0,21.34074,21.20364,22.79710,22.02166,14.71805,0.007246376811594203,0.014492753623188406,0.0036101083032490976,0.0,0.0143884892086330 94,0.036231884057971016,0.0036231884057971015,0.010830324909747292,0.11913357400722022,0.11151079136690648,0.021739130434782608,0.0036231884057971015,0.0036101083032490976,0.0,0.04316546762589928 CCATT,19226:19227:19228:19229:19230,NC_045512.2,1147.0:1147.0:1147.0:1148.0:1149.0,20.89151,17.89520,16.68318,17.40478,24.97805,0.0034873583260680036,0.0017436791630340018,0.004359197907585004,0.016550522 648083623,0.005221932114882507,0.008718395815170008,0.015693112467306015,0.02005231037489102,0.016550522648083623,0.01392515230635335,0.0034873583260680036,0.0017436791630340018,0.06713164777680906,0.0531 3588850174216,0.008703220191470844 AGGTG,25060:25061:25062:25063:25064,NC_045512.2,6103.0:6104.0:6105.0:6106.0:6106.0,20.43233,26.01695,27.54410,23.18556,26.73913,0.013436015074553498,0.00901048492791612,0.0019656019656019656,0.00294792007 8611202,0.0022928267278087126,0.049320006554153695,0.02621231979030144,0.011957411957411958,0.01621356043236161,0.022764493940386505,0.0024578076355890547,0.023591087811271297,0.000819000819000819,0.00180 15067147068458,0.0018015067147068458 CGGCA,3241:3242:3243:3244:3245,NC_045512.2,541.0:541.0:541.0:541.0:541.0,19.19778,20.41418,19.95336,20.74627,15.91760,0.005545286506469501,0.0,0.031423290203327174,0.022181146025878003,0.08687615526802218 ,0.0036968576709796672,0.0166358595194085,0.0166358595194085,0.031423290203327174,0.036968576709796676,0.0,0.009242144177449169,0.009242144177449169,0.009242144177449169,0.012939001848428836 ACGTC,391:392:393:394:395,NC_045512.2,843.0:843.0:843.0:844.0:846.0,14.77091,15.26196,18.84597,17.73454,14.94430,0.010676156583629894,0.0071174377224199285,0.004744958481613286,0.011848341232227487,0.0602 8368794326241,0.009489916963226572,0.0071174377224199285,0.03202846975088968,0.037914691943127965,0.00591016548463357,0.07829181494661921,0.00830367734282325,0.014234875444839857,0.08056872037914692,0.066 19385342789598 TGTTA,9816:9817:9818:9819:9820,NC_045512.2,198.0:198.0:198.0:198.0:198.0,22.51515,22.10101,17.77778,19.33854,18.24118,0.015151515151515152,0.025252525252525252,0.025252525252525252,0.020202020202020204,0. 020202020202020204,0.0,0.005050505050505051,0.015151515151515152,0.045454545454545456,0.050505050505050504,0.0,0.0,0.045454545454545456,0.030303030303030304,0.1414141414141414 ATGAC,29866:29867:29868:29869:29870,NC_045512.2,47472.0:47200.0:46266.0:23150.0:22003.0,15.51054,15.48072,18.09957,9.94369,8.50521,0.003981294236602629,0.0055720338983050845,0.0008645657718410929,0.004578 833693304535,0.02745080216334136,0.003159757330637007,0.003072033898305085,0.003501491375956426,0.0019438444924406047,0.0009998636549561424,0.0007583417593528817,0.001016949152542373,0.0007564950503609562 ,0.004362850971922246,0.0005453801754306231 TATTG,6232:6233:6234:6235:6236,NC_045512.2,446.0:446.0:447.0:449.0:449.0,16.05801,15.94837,13.51622,22.08145,20.10000,0.01569506726457399,0.026905829596412557,0.05592841163310962,0.015590200445434299,0.00 66815144766146995,0.04708520179372197,0.03139013452914798,0.015659955257270694,0.026726057906458798,0.051224944320712694,0.18834080717488788,0.17488789237668162,0.17225950782997762,0.015590200445434299,0. 0200445434298441 AGAAT,15946:15947:15948:15949:15950,NC_045512.2,566.0:566.0:566.0:566.0:566.0,22.60601,23.56714,25.48825,18.76855,11.11723,0.00530035335689046,0.0,0.0035335689045936395,0.0017667844522968198,0.00353356890 45936395,0.019434628975265017,0.04770318021201413,0.05123674911660778,0.08480565371024736,0.08657243816254417,0.0,0.0,0.022968197879858657,0.0,0.00530035335689046 AATCA,29683:29684:29685:29686:29687,NC_045512.2,52497.0:52671.0:52798.0:52918.0:53517.0,19.54042,17.83650,17.17247,15.44411,15.06646,0.01636283978132084,0.013233088416775835,0.032955793780067424,0.0443705 3554556106,0.04665807126707401,0.03384955330780807,0.02701676444343187,0.029110951172392895,0.030537813220454287,0.03268120410336902,0.04097376992971027,0.0072715536063488444,0.006269176862759953,0.055123 020522317545,0.010557392977932246 TTACA,11508:11509:11510:11511:11512,NC_045512.2,244.0:244.0:244.0:244.0:244.0,17.93277,14.42975,11.49770,16.24034,16.79018,0.02459016393442623,0.004098360655737705,0.06557377049180328,0.03278688524590164, 0.04918032786885246,0.10245901639344263,0.045081967213114756,0.02459016393442623,0.012295081967213115,0.028688524590163935,0.02459016393442623,0.00819672131147541,0.11065573770491803,0.045081967213114756, 0.08196721311475409

wt., 7 kwi 2020 o 00:05 WHUANLEE notifications@github.com napisał(a):

Hi @istolarek https://github.com/istolarek , sorry for the late reply. The error arose because the input feature table has 11 columns. While the command you used told the program to find features in columns 7,12, 22 and modification status information in 28.

The example sample[12].csv files can be used to play with SVM.py to train a model and make predictions. For instance, python3.6 SVM.py -a -p sample1.csv -t sample2.csv -cl 3,8 -mc 11 -o train_and_predict this commands will train (-t) a model with sampl2.csv and then make predictions (-p) with sample1.csv. Since modification status is already known in sample1.tsv, you can also estimate prediciton accuracy (-a). The features used for training are from column 3 and 8, which are q3 and mis3.

Of course you can directly make predictions with already trained models (-M) but you have to make sure you do have the correspondent features in your input file.

Hope this helps. I look forward to helping more if possible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/enovoa/EpiNano/issues/46#issuecomment-610061732, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHRJ46SK7IJZB6ONUR42D3RLJG4DANCNFSM4LY5CYZQ .

--

Ireneusz Stolarek (Irek)

Phone UK: +44 (0) 7951552811 *Phone Poland: +48 (0) 791935581 *[current site] Phone Netherlands: +31 (0) 623174657

skype: ireneusz.stolarek

View my profile on Linkedin:

[image: https://pl.linkedin.com/in/stolarekir] https://pl.linkedin.com/in/stolarekir

istolarek commented 4 years ago

Something is still going wrong.

I used the example files 1 and 2, run the command

python3.6 SVM.py -a -p sample1.csv -t sample2.csv -cl 3,8 -mc 11 -o train_and_predict

and got

"""""""""""" Commad: SVM.py -a -p sample1.csv -t sample2.csv -cl 3,8 -mc 11 -o train_and_predict Colunms-used: 3,8 output: train_and_predict.mis3.q3.SVM Traceback (most recent call last): File "SVM.py", line 118, in Xtrain, , ytrain, , indicestrain, = train_test_split(X,Y.values.ravel(), indices, test_size=0, random_state= 100) File "/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_split.py", line 2100, in train_test_split default_test_size=0.25) File "/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_split.py", line 1734, in _validate_shuffle_split '(0, 1) range'.format(test_size, n_samples)) ValueError: test_size=0 should be either positive and smaller than the number of samples 19817 or a float in the (0, 1) range

""""""""""

wt., 7 kwi 2020 o 00:05 WHUANLEE notifications@github.com napisał(a):

Hi @istolarek https://github.com/istolarek , sorry for the late reply. The error arose because the input feature table has 11 columns. While the command you used told the program to find features in columns 7,12, 22 and modification status information in 28.

The example sample[12].csv files can be used to play with SVM.py to train a model and make predictions. For instance, python3.6 SVM.py -a -p sample1.csv -t sample2.csv -cl 3,8 -mc 11 -o train_and_predict this commands will train (-t) a model with sampl2.csv and then make predictions (-p) with sample1.csv. Since modification status is already known in sample1.tsv, you can also estimate prediciton accuracy (-a). The features used for training are from column 3 and 8, which are q3 and mis3.

Of course you can directly make predictions with already trained models (-M) but you have to make sure you do have the correspondent features in your input file.

Hope this helps. I look forward to helping more if possible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/enovoa/EpiNano/issues/46#issuecomment-610061732, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHRJ46SK7IJZB6ONUR42D3RLJG4DANCNFSM4LY5CYZQ .

--

Ireneusz Stolarek (Irek)

Phone UK: +44 (0) 7951552811 *Phone Poland: +48 (0) 791935581 *[current site] Phone Netherlands: +31 (0) 623174657

skype: ireneusz.stolarek

View my profile on Linkedin:

[image: https://pl.linkedin.com/in/stolarekir] https://pl.linkedin.com/in/stolarekir

Huanle commented 4 years ago

Hi @istolarek , Regarding your first question, Given the input file you showed here and assume you will use the model that was already trained on q3,mis3,del3', the columns should be signified as-cl 7,12,22`. If you use a different model trained with different features, just change the column numbers accordingly. For you convenience, you can easily determine the column numbers for certain features in a pythonic way shown below:

h = '#Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5'.split(',')
>>> h.index('q3')+1
7
>>> h.index('mis3')+1
12
>>> h.index('del3')+1
22

As to whether you should epinano for your specific type of data, it is really hard to know before examining your data in more details. But if you have modified and unmodified sample complete. I am pretty sure you can give it a go by first using the features generated with epinano pipeline and plot them to see whether these features can be used to distinguishing your samples.

As for your second question, it seems to be an issue associated with sklearn rather than SVM.py itself. May I know the versions of the packages you are using?

istolarek commented 4 years ago

The scikit-learn version is 0.21.3 The numpy version is 1.16.3 The pandas version is 1.0.3

wt., 7 kwi 2020 o 12:41 WHUANLEE notifications@github.com napisał(a):

Hi @istolarek https://github.com/istolarek , Regarding your first question, Given the input file you showed here and assume you will use the model that was already trained on q3,mis3,del3', the columns should be signified as -cl 7,12,22`. If you use a different model trained with different features, just change the column numbers accordingly. For you convenience, you can easily determine the column numbers for certain features in a pythonic way shown below:

h = '#Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5'.split(',')

h.index('q3')+1 7 h.index('mis3')+1 12 h.index('del3')+1 22

As to whether you should epinano for your specific type of data, it is really hard to know before examining your data in more details. But if you have modified and unmodified sample complete. I am pretty sure you can give it a go by first using the features generated with epinano pipeline and plot them to see whether these features can be used to distinguishing your samples.

As for your second question, it seems to be an issue associated with sklearn rather than SVM.py itself. May I know the versions of the packages you are using?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/enovoa/EpiNano/issues/46#issuecomment-610313422, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHRJ463SJFM4CWUIOJPEL3RLL7MZANCNFSM4LY5CYZQ .

--

Ireneusz Stolarek (Irek)

Phone UK: +44 (0) 7951552811 *Phone Poland: +48 (0) 791935581 *[current site] Phone Netherlands: +31 (0) 623174657

skype: ireneusz.stolarek

View my profile on Linkedin:

[image: https://pl.linkedin.com/in/stolarekir] https://pl.linkedin.com/in/stolarekir

Huanle commented 4 years ago

Hi @istolarek , can you create a virtual environment and install packages with specified versions as you can see here? I have tried it out and did not run into any error.

istolarek commented 4 years ago

thank you, this helped! numpy version was the main issue

czw., 9 kwi 2020 o 22:19 WHUANLEE notifications@github.com napisał(a):

Hi @istolarek https://github.com/istolarek , can you create a virtual environment and install packages with specified versions as you can see here https://github.com/enovoa/EpiNano#getting-started-and-pre-requisites? I have tried it out and did not run into any error.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/enovoa/EpiNano/issues/46#issuecomment-611733398, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHRJ42CRFYYZ33OQ3Q6W6DRLYUWTANCNFSM4LY5CYZQ .

--

Ireneusz Stolarek (Irek)

Phone UK: +44 (0) 7951552811 *Phone Poland: +48 (0) 791935581 *[current site] Phone Netherlands: +31 (0) 623174657

skype: ireneusz.stolarek

View my profile on Linkedin:

[image: https://pl.linkedin.com/in/stolarekir] https://pl.linkedin.com/in/stolarekir