marc harary meeting 2020-10-16 minute

This is an update on issue https://github.com/kad-ecoli/rna3db/issues/2

As shown in Marc's benchmark of SPOT-RNA on the PDB dataset, there are systematically more base pairs in label than in SPOT-RNA prediction:

BPpred	F1	MCC	BPnat
23.46	0.79	0.80	32.40

It is suspected that this is because SPOT-RNA (just like any other RNA SS predictor) is only able to predict canonical base pairs (Watson-Crick, or WC, and Wobble base pairs), but the label files provided by SPOT-RNA includes all base pairs (both canonical and non-canonical) assigned by DSSR . Therefore, it may be necessary to re-benchmark SPOT-RNA on a new label file that only includes on WC and Wobble base pairs.

Here is code example to generate base pairs for TS1_labels/1f7u-1-B from the SPOT-RNA dataset.

Download PDB 1f7u chain B using https://github.com/kad-ecoli/rna3db/blob/master/script/fetch.py

fetch.py 1f7uB

This will download 1f7uB.pdb, which should be cleaned by https://github.com/kad-ecoli/rna3db/blob/master/script/clean_pdb.py

clean_pdb.py 1f7uB.pdb clean.pdb -StartIndex=1 -NewChainID=_

Get the sequence from the resulting clean.pdb using https://github.com/kad-ecoli/rna3db/blob/master/script/pdb2fasta.py

pdb2fasta.py clean.pdb > 1f7uB.fasta

Notice that the sequence in clean.fasta is in lowercase, but SPOT-RNA usually takes uppercase sequence. You may want to modify the pdb2fasta.py program to make it output uppercase sequence. Run the dssr program to perform the SS assignment, and grep only canonical base pairs to put it into 1f7u-1-B.label

echo "# 1f7u-1-B" > 1f7u-1-B.label
echo "i             j" >> 1f7u-1-B.label
x3dna-dssr  -i=clean.pdb --pair-only|grep -P "( WC )|( Wobble )"|cut -c6-35|sed 's/[A-Z]//g' >> 1f7u-1-B.label

The dssr program at https://github.com/kad-ecoli/rna3db/blob/master/script/x3dna-dssr is for Linux. For mac, you need to register and download the program from the official dssr website at http://forum.x3dna.org/site-announcements/download-instructions/

Plan for Marc before next week (2020-10-22): use the above scripts to reconstruct the sequence and label files for all RNAs from the SPOT-RNA PDB dataset. Run SPOT-RNA, MXfold2 and e2efold on both the original SPOT-RNA PDB dataset and the new reconstructed dataset. Report the accuracy (MCC and F1) for both the original and reconstructed dataset during the weekly meeting.

kad-ecoli / rna3db

marc harary meeting 2020-10-16 minute #3

Here is the average result on SPOT-RNA PDB dataset for all three methods:	Method	BPpred	F1	MCC
SPOT-RNA	23.46	0.79	0.80	32.40
MXfold2	20.97	0.67	0.68	32.40
E2Efold	12.53	0.29	0.21	32.40