uci-cbcl / UFold

MIT License
58 stars 26 forks source link

Issues training custom model #4

Open marcellszi opened 2 years ago

marcellszi commented 2 years ago

I've attempted to reproduce some of the results from the paper:

L. Fu, Y. Cao, J. Wu, Q. Peng, Q. Nie, and X. Xie, "UFold: fast and accurate RNA secondary structure prediction with deep learning", Nucleic Acids Research, p. gkab1074, Nov. 2021, doi: 10.1093/nar/gkab1074.

I attempted to re-train UFold on a custom dataset, but ran into some issues, and have a few questions I'm hoping you can help clear up.

Running testing script

After following the installation instructions, I attempted to check if the installation succeeded (by running python ufold_test.py --test_files TS2), which resulted in the following traceback:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022034529/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "ufold_test.py", line 341, in <module>
    main()
  File "ufold_test.py", line 204, in main
    torch.cuda.set_device(2)
  File "/home/usr/anaconda3/envs/UFold/lib/python3.6/site-packages/torch/cuda/__init__.py", line 292, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1579022034529/work/torch/csrc/cuda/Module.cpp:59

I suspected that this issue was simply due to setting a hardcoded CUDA device that I don't have, so I made the following changes to ufold_test.py:

72c72
<     device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
---
>     device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
204c204
<     torch.cuda.set_device(0)
---
>     torch.cuda.set_device(2)
257c257
<     device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
---
>     device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
322c322
<     contact_net.load_state_dict(torch.load(MODEL_SAVED,map_location='cuda:0'))
---
>     contact_net.load_state_dict(torch.load(MODEL_SAVED,map_location='cuda:1'))

After this, the test script runs without issues.

Running training script

I then attempted to train UFold on some other datasets, starting with the provided TS0 as an example. First, I made similar fixes the to hardcoded CUDA devices within ufold_train.py. Then, when trying to train the model via python ufold_train.py --train_files TS0, I ran into more issues.

The code contains several breakpoints (pdb.set_trace()), I assume simply left over from debugging. However, continuing through the breakpoints results in the following traceback:

Traceback (most recent call last):
  File "ufold_train.py", line 220, in <module>
    main()
  File "ufold_train.py", line 189, in main
    train_merge = Dataset_FCN_merge(train_data_list)
  File "/home/usr/ufold2/UFold/ufold/data_generator.py", line 609, in __init__
    self.data = self.merge_data(data_list)
  File "/home/usr/ufold2/UFold/ufold/data_generator.py", line 617, in merge_data
    self.data2.data_x = np.concatenate((data_list[0].data_x,data_list[1].data_x),axis=0)
IndexError: list index out of range

I believe this is because you must provide two datasets as arguments, i.e. python ufold_train.py --train_files dataset_A dataset_B, for example. However, it's not clear to me why this is the case. Is one of the datasets used for pre-training?

Finally, the training runs through an epoch, and then fails due to what I assume to be a hardcoded save path with the following traceback:

Traceback (most recent call last):
  File "ufold_train.py", line 220, in <module>
    main()
  File "ufold_train.py", line 210, in main
    train(contact_net,train_merge_generator,epoches_first)
  File "ufold_train.py", line 43, in train
    steps_done = 0
  File "/home/usr/anaconda3/envs/UFold/lib/python3.6/site-packages/torch/serialization.py", line 327, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "/home/usr/anaconda3/envs/UFold/lib/python3.6/site-packages/torch/serialization.py", line 212, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/usr/anaconda3/envs/UFold/lib/python3.6/site-packages/torch/serialization.py", line 193, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '../models_ckpt/final_model/for_servermodel/tmp/ufold_train_onalldata_0.pt'

Could you please assist me in re-creating your training methodology for a custom dataset? Additionally, could you detail how I might go about re-training with synthetic data as mentioned in the paper, along with the methodology to generate it? I have found references to multiple training steps in ufold/config.json (for example, epoches_first, epoches_second, and epoches_third), but no other references anywhere else in the code.

sperfu commented 2 years ago

Hi there,

Thanks for reaching out. The first issue you point out is due to the GPU id, you can switch to your own cuda device id or change it to cpu, our framework is capable of running using cpu. The second one you ran into is because our model support for multiple datasets input, therefore we call a function that merge all the datasets, we also changed our code for only one dataset as you can see in the data_generator.py file. The third issue is simply due to the pre-trained model saving path error, which we have changed that to the working path, you may change to your own path as well. Lastly, some references in the ufold/config.json file didn't show in the code because some params are used for earlier debugging and test. So we have delete some unecessary references. Last but not least, for the synthetic data you mentioned, for each real sequence, we first generate 3 synthetic sequences to create a pool of synthetic sequences by randomly mutate some nucleotides from the corresponding real sequence. In order to make the sequences pass the redundant removal procedure to keep clear with the training set, we then use CD-HIT 80 to remove any sequences that have similarity over 80% to real sequences. The synthetic ground truth labels are generated with Contrafold, which then use to train UFold.

Thanks

marcellszi commented 2 years ago

Hi,

Thank you very much for your quick response and fixes.

I see your last four commits addressed my issues. I appreciate the help. I was able to start training a model after your changes.

Quick FYI: 8db5a90 breaks things due to 528533143e194854e264fcfd9802252c95f2f6b7/ufold/config.py#L24, but I was able to trivially fix it by reverting the config file.