Open marcellszi opened 2 years ago
Hi there,
Thanks for reaching out. The first issue you point out is due to the GPU id, you can switch to your own cuda device id or change it to cpu, our framework is capable of running using cpu. The second one you ran into is because our model support for multiple datasets input, therefore we call a function that merge all the datasets, we also changed our code for only one dataset as you can see in the data_generator.py file. The third issue is simply due to the pre-trained model saving path error, which we have changed that to the working path, you may change to your own path as well. Lastly, some references in the ufold/config.json file didn't show in the code because some params are used for earlier debugging and test. So we have delete some unecessary references. Last but not least, for the synthetic data you mentioned, for each real sequence, we first generate 3 synthetic sequences to create a pool of synthetic sequences by randomly mutate some nucleotides from the corresponding real sequence. In order to make the sequences pass the redundant removal procedure to keep clear with the training set, we then use CD-HIT 80 to remove any sequences that have similarity over 80% to real sequences. The synthetic ground truth labels are generated with Contrafold, which then use to train UFold.
Thanks
Hi,
Thank you very much for your quick response and fixes.
I see your last four commits addressed my issues. I appreciate the help. I was able to start training a model after your changes.
Quick FYI: 8db5a90 breaks things due to 528533143e194854e264fcfd9802252c95f2f6b7/ufold/config.py#L24, but I was able to trivially fix it by reverting the config file.
I've attempted to reproduce some of the results from the paper:
L. Fu, Y. Cao, J. Wu, Q. Peng, Q. Nie, and X. Xie, "UFold: fast and accurate RNA secondary structure prediction with deep learning", Nucleic Acids Research, p. gkab1074, Nov. 2021, doi: 10.1093/nar/gkab1074.
I attempted to re-train UFold on a custom dataset, but ran into some issues, and have a few questions I'm hoping you can help clear up.
Running testing script
After following the installation instructions, I attempted to check if the installation succeeded (by running
python ufold_test.py --test_files TS2
), which resulted in the following traceback:I suspected that this issue was simply due to setting a hardcoded CUDA device that I don't have, so I made the following changes to ufold_test.py:
After this, the test script runs without issues.
Running training script
I then attempted to train UFold on some other datasets, starting with the provided
TS0
as an example. First, I made similar fixes the to hardcoded CUDA devices within ufold_train.py. Then, when trying to train the model viapython ufold_train.py --train_files TS0
, I ran into more issues.The code contains several breakpoints (
pdb.set_trace()
), I assume simply left over from debugging. However, continuing through the breakpoints results in the following traceback:I believe this is because you must provide two datasets as arguments, i.e.
python ufold_train.py --train_files dataset_A dataset_B
, for example. However, it's not clear to me why this is the case. Is one of the datasets used for pre-training?Finally, the training runs through an epoch, and then fails due to what I assume to be a hardcoded save path with the following traceback:
Could you please assist me in re-creating your training methodology for a custom dataset? Additionally, could you detail how I might go about re-training with synthetic data as mentioned in the paper, along with the methodology to generate it? I have found references to multiple training steps in ufold/config.json (for example,
epoches_first
,epoches_second
, andepoches_third
), but no other references anywhere else in the code.