Many Errors in Data and Test pipeline

Rungetf commented 3 years ago

Hi,

I wanted to run UFold on some own test data (the original TS1 data accompanying the SPOT-RNA publication. After putting the respective files in .bpseq format into the data/TS1 folder, I tried running the process_data_newdataset.py script to produce the cPickle files needed. I used the following command

python process_data_newdataset.py data/TS1

which produces a NameError because one_hot_matrix is not defined. This results from an Error in the awk call in line 55, which cannot find the files due to a missing /. Changing the command to

python process_data_newdataset.py data/TS1/

fixes the NameError but results in a ValueError in the list comprehension in line 69. I finally changed line 55 from

t0 = subprocess.getstatusoutput('awk \'{print $2}\' '+file_dir+item_file)

to

t0 = subprocess.getstatusoutput('awk \'{print $2}\' '+file_dir + '/' +item_file)

and ran the call without trailing /, which fixes the Errors. However, I got stucked in the pdb.set_trace() call in line 127. Unfortunately, without this call there is still a FileNotFoundError due to a hard-coded path in the final cPickle dump that needs to be fixed (setting the path to file_dir + '.cPickle' produces the desired output file).

After that I tried running the ufold_test.py script to evaluate the performance of UFold on the produced data but ran into similar issues:

Call stops at pdb.set_trace()
Hard-coded model paths don't fit the provided models in the drive
- The provided models are ufold_train_alldata.pt, ufold_train_pdbfinetune.pt, and ufold_train.pt
- In the code there is unet_train_on_merge_alldata_98.pt and ufold_train_on_pdb_contrafold_pdbfinetune_99.pt in lines 229 and 231, respectively
A ModuleNotFoundError when setting --nc True because e2efold cannot be found (import in line 25)

And maybe some more that I currently don't remember.

However, I finally managed to run the script on TS1 but the results were very poor with the provided models (in the range of 3e-13 f1-scores). Probably there is more that needs to be fix that I'm not aware of yet.

After that, I switched to the Webserver but got empty files for download with the first two sample sequences I tested (both .ct and dot-bracket files; with and without non-canonical pairs).

From a user perspective this was a very bad experience and code accompanying such a recent NAR publication should at least have running example scripts that can be used out-of-the-box I think.

Having said that, I'm looking forward to running your code once the issues have been fixed.

Best regards

sperfu commented 3 years ago

Hi there,

Thanks for reaching out and pointing out those bugs regarding to our codes and sorry for the inconvenience caused by the problems when using them. We have fixed these as soon as possible. Details are listed below.

Regarding to the process_data_newdataset.py file, we did miss one '/' in the code line 43, we have fixed that issue and uncommented the pdb.set_trace() function. We have tested on our server, it should be working fine now.
Regarding to the ufold_test.py file, it's the pre-trained model name that affects the running. We have fixed the conflict and rename the model name according to the drive file name. We also fixed the --nc parameter fault. We have tested them to make sure it works fine now.
Thanks for pointing out this issue. For the TS1 test set, during our testing, we filtered out the structures that contain protein complexes or multiple RNA chains as our model is not trained on those complex scenarios. This same set of sequences are used for testing all the other methods in our manuscript. We reported the number of test sequences in supplementary table and we have uploaded our test file with the provided link.
Last but not least, our webserver is fully functional, the reason you encountered is due to a minor logical bug in the download panel in our webserver, we have now fixed that bug. You would not be able to click on the download button unless you click on 'Show all data' button, which will refresh all the existing result to the output file.

Overall, all the errors have been corrected from our end. We really appreciated your careful work and we look forward you could use the tool.

Thanks

Rungetf commented 3 years ago

Hi, Thanks for the fast reply and the fixes! I appreciate linking the data and just tested with your provided files. Everything works fine now, thanks.

uci-cbcl / UFold

Many Errors in Data and Test pipeline #3