kad-ecoli / rna3db

maintain local copy of RNA structure database
0 stars 0 forks source link

marc harary meeting 2020-11-05 minute #11

Open kad-ecoli opened 3 years ago

kad-ecoli commented 3 years ago
  1. As a next step for this project, Marc should try to retrain the MXfold2 PyTorch models to see if the retrained model can reproduce the performance using the same neural network architecture and parameters. The MXfold2 training script is available at its github website. You should be able to draw a plot for the training/validation loss (y-axis) versus training epoch (x-axis) as shown below (the real plot is usually not that smooth): https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

  2. Additionally, Marc should read the RNAcontact paper https://doi.org/10.1093/bioinformatics/btaa932

  3. There will be a co-mentor for Marc to help with Farnam usage until Chengxin officially join Yale in next Feburary. Most likely the co-mentor will be Rafael Tavares. The responsibility of the co-mentor is mainly on Farnam usage, while Chengxin is still responsible for guiding the scientific development of this project.

kad-ecoli commented 3 years ago

I may have misled you earlier. MXfold2 actually has several datasets, including archiveII, bpRNA, bpRNAnew, Rivas, RNAStrAlign. The default dataset used by the MXfold2 training script is Rivas (aka TrainSetA). The archiveII and RNAStrAlign datasets are from E2Efold and should not be used due to being ridiculously redundant. The bpRNA (aka TR0) and bpRNAnew are original and modified SPOT-RNA pre-training dataset, respectively. It is surprising that MXfold2 did not include the SPOT-RNA PDB dataset.

As the initial training step, you can try simultaneous (and separate) training on both Rivas and bpRNA. We shall see which model gives better performance.

kad-ecoli commented 3 years ago

In case you have not yet figured out how to save checkpoint files for each epoch, they can be saved using --log-dir flag. You can resume from the checkpoint file of a specific epoch using --resume. This could be useful when you perform transfer learning in the future.

Moreover, if I understand loss.py correctly, while L1 regularization is calculated, L2 regularization is not calculated (commented out). This may not be evident if you use the default training parameters , where both --l1-weight and --l2-weight are 0 by default.

marc-harary commented 3 years ago

Training mxfold2 on the PDB dataset has been going well. By the way, I don't know if you saw, but a new database of 23 million molecules was released two weeks ago. It contains the secondary structures of 14 million molecules.

https://rnacentral.org/search?q=has_secondary_structure:%22True%22

kad-ecoli commented 3 years ago

I am aware of RNAcentral and know that they provide secondary structure. Most of the secondary structures are template-based prediction (https://rnacentral.org/help/secondary-structure). I am not sure whether including template-based secondary structure prediction is helpful for pre-training, but you can certainly try. Most RNAs in the RNAcentral database are rRNAs and tRNAs. You will need to remove redundant sequences, e.g. by cd-hit-est.