Open marc-harary opened 4 years ago
What does the last row (val2_seq) mean?
score.zip If I ignore the one target with only non-standard nucleotide type (4r8iB) and the unknown last row (val2_seq), the performance of the three predictors are summarized in the following table, while the full table is attached.
Method | BPpred | F1 | MCC | BPnat | label |
---|---|---|---|---|---|
SPOT-RNA | 23.52 | 0.79 | 0.80 | 32.47 | original label provided by SPOT-RNA package for all base pairs |
MXfold2 | 20.97 | 0.67 | 0.68 | 32.47 | original label provided by SPOT-RNA package for all base pairs |
E2Efold | 12.53 | 0.20 | 0.21 | 32.47 | original label provided by SPOT-RNA package for all base pairs |
SPOT-RNA | 23.29 | 0.83 | 0.84 | 21.63 | reconstructed label from dssr for canonical base pairs |
MXfold2 | 20.89 | 0.81 | 0.81 | 21.63 | reconstructed label from dssr for canonical base pairs |
E2Efold | 12.60 | 0.25 | 0.25 | 21.63 | reconstructed label from dssr for canonical base pairs |
We can draw the following conclusions.
You can present these data to Dr Pyle next week. Please remember to explain what is the difference between the original label and the reconstructed label. You can also explain how SPOT-RNA or MXfold2 work.
val2_seq is the name of one of the files in the PDB dataset. On further inspection of the file, it appears to contain the sequences of multiple molecules that weren't separated by my original bash script. I'm working right now on adding them to the dataset using the scripts you wrote.
I checked all 60 sequences in VL1_sequences/val2_seq. They are all included in your score.csv. You should ignore this file to avoid double-counting.
As a next step for your project, you should try to retrain the SPOT-RNA TensorFlow models (or MXfold2 PyTorch models if you prefer PyTorch to TensorFlow) to see if you can reproduce the performance using the same neural network architecture and parameters. The MXfold2 training script is available at its github website, while you have to write your own SPOT-RNA training script.
According to SPOT-RNA paper RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning, its TensorFlow models are first trained on bpRNA dataset (TR0 for training, VL0 for validation, TS0 for testing). The bpRNA pretrained models are further trained on the PDB dataset (TR1 for training, VL1 for validation, TS1 for testing). There are five models, each with a different set of model parameters (number of layers, depth of layers, kernel size, dilation factor, and learning rate), trained on the same dataset. Apart from dropout rate (25%), not much is known about training hyperparameters such as learning rate(s) used in the ADAM optimization. You will need to trial-and-test yourself. Luckily we have most of the model parameters at SPOT-RNA-models/model*.meta for each of the 5 models from the SPOT-RNA package.
Okay, great. It looks like every sequence contained in the val2_seq
file is redundant. I'm not sure why a sequence ended up in an output file with the same name, but I think the row in the output .csv file just needs to be deleted.
Sorry for the delays! scores.zip