As shown in Marc's benchmark of SPOT-RNA on the PDB dataset, there are systematically more base pairs in label than in SPOT-RNA prediction:
BPpred
F1
MCC
BPnat
23.46
0.79
0.80
32.40
It is suspected that this is because SPOT-RNA (just like any other RNA SS predictor) is only able to predict canonical base pairs (Watson-Crick, or WC, and Wobble base pairs), but the label files provided by SPOT-RNA includes all base pairs (both canonical and non-canonical) assigned by DSSR . Therefore, it may be necessary to re-benchmark SPOT-RNA on a new label file that only includes on WC and Wobble base pairs.
Here is code example to generate base pairs for TS1_labels/1f7u-1-B from the SPOT-RNA dataset.
Notice that the sequence in clean.fasta is in lowercase, but SPOT-RNA usually takes uppercase sequence. You may want to modify the pdb2fasta.py program to make it output uppercase sequence.
Run the dssr program to perform the SS assignment, and grep only canonical base pairs to put it into 1f7u-1-B.label
Plan for Marc before next week (2020-10-22): use the above scripts to reconstruct the sequence and label files for all RNAs from the SPOT-RNA PDB dataset. Run SPOT-RNA, MXfold2 and e2efold on both the original SPOT-RNA PDB dataset and the new reconstructed dataset. Report the accuracy (MCC and F1) for both the original and reconstructed dataset during the weekly meeting.
This is an update on issue https://github.com/kad-ecoli/rna3db/issues/2
As shown in Marc's benchmark of SPOT-RNA on the PDB dataset, there are systematically more base pairs in label than in SPOT-RNA prediction:
It is suspected that this is because SPOT-RNA (just like any other RNA SS predictor) is only able to predict canonical base pairs (Watson-Crick, or WC, and Wobble base pairs), but the label files provided by SPOT-RNA includes all base pairs (both canonical and non-canonical) assigned by DSSR . Therefore, it may be necessary to re-benchmark SPOT-RNA on a new label file that only includes on WC and Wobble base pairs.
Here is code example to generate base pairs for TS1_labels/1f7u-1-B from the SPOT-RNA dataset.
Download PDB 1f7u chain B using https://github.com/kad-ecoli/rna3db/blob/master/script/fetch.py
This will download 1f7uB.pdb, which should be cleaned by https://github.com/kad-ecoli/rna3db/blob/master/script/clean_pdb.py
Get the sequence from the resulting clean.pdb using https://github.com/kad-ecoli/rna3db/blob/master/script/pdb2fasta.py
Notice that the sequence in clean.fasta is in lowercase, but SPOT-RNA usually takes uppercase sequence. You may want to modify the pdb2fasta.py program to make it output uppercase sequence. Run the dssr program to perform the SS assignment, and grep only canonical base pairs to put it into 1f7u-1-B.label
The dssr program at https://github.com/kad-ecoli/rna3db/blob/master/script/x3dna-dssr is for Linux. For mac, you need to register and download the program from the official dssr website at http://forum.x3dna.org/site-announcements/download-instructions/
Plan for Marc before next week (2020-10-22): use the above scripts to reconstruct the sequence and label files for all RNAs from the SPOT-RNA PDB dataset. Run SPOT-RNA, MXfold2 and e2efold on both the original SPOT-RNA PDB dataset and the new reconstructed dataset. Report the accuracy (MCC and F1) for both the original and reconstructed dataset during the weekly meeting.