kad-ecoli / rna3db

maintain local copy of RNA structure database
0 stars 0 forks source link

marc harary meeting 2020-10-16 minute #3

Closed kad-ecoli closed 3 years ago

kad-ecoli commented 3 years ago

This is an update on issue https://github.com/kad-ecoli/rna3db/issues/2

As shown in Marc's benchmark of SPOT-RNA on the PDB dataset, there are systematically more base pairs in label than in SPOT-RNA prediction:

BPpred F1 MCC BPnat
23.46 0.79 0.80 32.40

It is suspected that this is because SPOT-RNA (just like any other RNA SS predictor) is only able to predict canonical base pairs (Watson-Crick, or WC, and Wobble base pairs), but the label files provided by SPOT-RNA includes all base pairs (both canonical and non-canonical) assigned by DSSR . Therefore, it may be necessary to re-benchmark SPOT-RNA on a new label file that only includes on WC and Wobble base pairs.

Here is code example to generate base pairs for TS1_labels/1f7u-1-B from the SPOT-RNA dataset.

Download PDB 1f7u chain B using https://github.com/kad-ecoli/rna3db/blob/master/script/fetch.py

fetch.py 1f7uB

This will download 1f7uB.pdb, which should be cleaned by https://github.com/kad-ecoli/rna3db/blob/master/script/clean_pdb.py

clean_pdb.py 1f7uB.pdb clean.pdb -StartIndex=1 -NewChainID=_

Get the sequence from the resulting clean.pdb using https://github.com/kad-ecoli/rna3db/blob/master/script/pdb2fasta.py

pdb2fasta.py clean.pdb > 1f7uB.fasta

Notice that the sequence in clean.fasta is in lowercase, but SPOT-RNA usually takes uppercase sequence. You may want to modify the pdb2fasta.py program to make it output uppercase sequence. Run the dssr program to perform the SS assignment, and grep only canonical base pairs to put it into 1f7u-1-B.label

echo "# 1f7u-1-B" > 1f7u-1-B.label
echo "i             j" >> 1f7u-1-B.label
x3dna-dssr  -i=clean.pdb --pair-only|grep -P "( WC )|( Wobble )"|cut -c6-35|sed 's/[A-Z]//g' >> 1f7u-1-B.label

The dssr program at https://github.com/kad-ecoli/rna3db/blob/master/script/x3dna-dssr is for Linux. For mac, you need to register and download the program from the official dssr website at http://forum.x3dna.org/site-announcements/download-instructions/

Plan for Marc before next week (2020-10-22): use the above scripts to reconstruct the sequence and label files for all RNAs from the SPOT-RNA PDB dataset. Run SPOT-RNA, MXfold2 and e2efold on both the original SPOT-RNA PDB dataset and the new reconstructed dataset. Report the accuracy (MCC and F1) for both the original and reconstructed dataset during the weekly meeting.

kad-ecoli commented 3 years ago
Here is the average result on SPOT-RNA PDB dataset for all three methods: Method BPpred F1 MCC BPnat
SPOT-RNA 23.46 0.79 0.80 32.40
MXfold2 20.97 0.67 0.68 32.40
E2Efold 12.53 0.29 0.21 32.40