uci-cbcl / UFold

MIT License
59 stars 29 forks source link

got error when processing training data #24

Open sindax123 opened 1 year ago

sindax123 commented 1 year ago

Hi Dear developer, I got an error when procesing training data with TR0 data provided by MXfold2

$ python process_data_newdataset.py TR0 Traceback (most recent call last): File "process_data_newdataset.py", line 69, in pair_dict_all_list = [[int(item_tmp)-1,int(t2[1].split('\n')[index_tmp])-1] for index_tmp,item_tmp in enumerate(t1[1].split('\n')) if int(t2[1].split('\n')[index_tmp]) != 0] File "process_data_newdataset.py", line 69, in pair_dict_all_list = [[int(item_tmp)-1,int(t2[1].split('\n')[index_tmp])-1] for index_tmp,item_tmp in enumerate(t1[1].split('\n')) if int(t2[1].split('\n')[index_tmp]) != 0] ValueError: invalid literal for int() with base 10: 'X'

Having no idea of what the data exactly look like , I feel confused with this problem. Could you please tell me how to fix it ? Thank you!

sindax123 commented 1 year ago

when i tried to print t0,t1,t2 in the code some of the files are successfully processed while others turned out t0 t1 t2 respectively are (0, 'OS') (0, '\x00\x05\x16\x07\x00\x02\x00\x00Mac') (0, 'X')

sperfu commented 1 year ago

Hi there,

Since we used this script to process different formats of training data. So we may altered some of the scripts in process_data_newdataset.py during processing. So one solution way is to find out what is the data composed of by using pickle(python package) to load those files and check the exact details in those file. I hope that will work.

Thanks

sindax123 commented 1 year ago

Thank you for your reply!I checked the component of the data and found some of the data invalid.It ouputs "OS" instead of rna sequence,accounting for at least a half of the dataset.I wonder if such situation is normal or there is something wrong with my dataset. If there is something wrong with my dataset, where else can i get those data?

sperfu commented 1 year ago

I wonder if there is some format issue related to the system(like "OS""Mac" etc.), it seems you used MacOS to deal with those files. We process those file using Linux(Ubuntu). You may pay attention to that. Secondly, if that doesn't solve your problem. You may resort to MXfold2 paper. They also provide those datasets.

sindax123 commented 1 year ago

Thank you for your reply! I think I have figured out what the problem is by double checking the data! In the TR0 folder I downloaded each piece of rna sequence contains two document named“._bpRNA_XXXXX”and“bpRNA_XXXX” respectively.I suppose it would be fixed by adding a selective condition.