Final scores for cleaned data

marc-harary commented 4 years ago

Sorry for the delays! scores.zip

kad-ecoli commented 4 years ago

What does the last row (val2_seq) mean?

kad-ecoli commented 4 years ago

score.zip If I ignore the one target with only non-standard nucleotide type (4r8iB) and the unknown last row (val2_seq), the performance of the three predictors are summarized in the following table, while the full table is attached.

Method	BPpred	F1	MCC	BPnat	label
SPOT-RNA	23.52	0.79	0.80	32.47	original label provided by SPOT-RNA package for all base pairs
MXfold2	20.97	0.67	0.68	32.47	original label provided by SPOT-RNA package for all base pairs
E2Efold	12.53	0.20	0.21	32.47	original label provided by SPOT-RNA package for all base pairs
SPOT-RNA	23.29	0.83	0.84	21.63	reconstructed label from dssr for canonical base pairs
MXfold2	20.89	0.81	0.81	21.63	reconstructed label from dssr for canonical base pairs
E2Efold	12.60	0.25	0.25	21.63	reconstructed label from dssr for canonical base pairs

We can draw the following conclusions.

For all three programs, the F1 and MCC assessed by the reconstructed label for canonical base pairs are consistently better than F1 and MCC assessed by the original label for all base pairs. These data suggest that all the three programs are actually design to predict canonical base pairs only. Therefore, in subsequent training and evaluation, we should use our reconstructed labels for canonical base pairs.
SPOT-RNA slightly outperforms MXfold2, which in turn significantly outperforms E2Efold. Therefore, we use SPOT-RNA (preferred) or MXfold2 as the basis for our development of SS and distance predictor.

You can present these data to Dr Pyle next week. Please remember to explain what is the difference between the original label and the reconstructed label. You can also explain how SPOT-RNA or MXfold2 work.

marc-harary commented 4 years ago

val2_seq is the name of one of the files in the PDB dataset. On further inspection of the file, it appears to contain the sequences of multiple molecules that weren't separated by my original bash script. I'm working right now on adding them to the dataset using the scripts you wrote.

kad-ecoli commented 4 years ago

I checked all 60 sequences in VL1_sequences/val2_seq. They are all included in your score.csv. You should ignore this file to avoid double-counting.

kad-ecoli commented 4 years ago

As a next step for your project, you should try to retrain the SPOT-RNA TensorFlow models (or MXfold2 PyTorch models if you prefer PyTorch to TensorFlow) to see if you can reproduce the performance using the same neural network architecture and parameters. The MXfold2 training script is available at its github website, while you have to write your own SPOT-RNA training script.

According to SPOT-RNA paper RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning, its TensorFlow models are first trained on bpRNA dataset (TR0 for training, VL0 for validation, TS0 for testing). The bpRNA pretrained models are further trained on the PDB dataset (TR1 for training, VL1 for validation, TS1 for testing). There are five models, each with a different set of model parameters (number of layers, depth of layers, kernel size, dilation factor, and learning rate), trained on the same dataset. Apart from dropout rate (25%), not much is known about training hyperparameters such as learning rate(s) used in the ADAM optimization. You will need to trial-and-test yourself. Luckily we have most of the model parameters at SPOT-RNA-models/model*.meta for each of the 5 models from the SPOT-RNA package.

marc-harary commented 4 years ago

Okay, great. It looks like every sequence contained in the val2_seq file is redundant. I'm not sure why a sequence ended up in an output file with the same name, but I think the row in the output .csv file just needs to be deleted.

marc-harary commented 4 years ago

mccs

kad-ecoli / rna3db

Final scores for cleaned data #10