ml4bio / e2efold

pytorch implementation for "RNA Secondary Structure Prediction By Learning Unrolled Algorithms"
MIT License
106 stars 17 forks source link

unexpected poor performance of e2efold #5

Open kad-ecoli opened 4 years ago

kad-ecoli commented 4 years ago

I have tested e2efold on a set of 361 PDB chains, where secondary structures for RNAs shorter than 600 nucleotides are predicted by e2efold_productive/e2efold_productive_short.py, while those longer than 600 nucleotides are predicted by e2efold_productive/e2efold_productive_long.py.

To my big surprise, when evaluated against DSSR assigned canonical base pairs of this dataset, e2efold predicted *.ct files have very low average F1 and MCC of 0.2400 and 0.2401, respectively, which are significantly worse than SOTA methods mentioned in Table 2 of the e2efold paper (https://openreview.net/pdf?id=S1eALyrYDH). The following is my benchmark result, ranked in ascending order of F1 score.

Method F1 MCC Predicted base pairs per RNA
e2efold 0.2400 0.2401 18.2133
mfold 0.6275 0.6285 32.4903
RNAstructure (ProbablePair) 0.6443 0.6475 29.4238
CONTRAfold 0.6617 0.6642 32.5845

I have attached the predicted ct files below. Additionally, I include the 4 sequences listed under e2efold_productive/_seqs/seq and make sure that my run generates identical ct files as the one shown in the github repository. e2e.zip

Could you check whether I run the e2efold program incorrectly and results in such a low performance? In particular, could you check why e2efold has on average only 18.2133 predicted base pairs per RNA chain, while the actual average number of canonical base pairs in the native structure is as many as 28.6648? Thank you.

liyu95 commented 4 years ago

Hi, thank you very much for your interest!

For your information, unlike the previous methods, our method is a learning method. We have the assumption that the distribution of the testing dataset should be similar to the training dataset. We also notice that the performance of our method is not very good if we apply the trained model onto a new RNA type which does not appear in the training dataset. The reason is that the distribution and threshold learned from the training sequences are different from those in the testing sequences. We found that the distribution of the data you provided is different from our training data, including the sequence length distribution and the RNA type distribution. If you want to benchmark our method or apply our method onto a new RNA type, we suggest you retrain the model.

The our-of-distribution learning problem is a very fascinating and open research direction, where we are currently exploring.

kad-ecoli commented 4 years ago

I am afraid this may not be the only reason. For example, SPOT-RNA is also a deep learning algorithm similar to e2efold, except that it does not have unrolled algorithms and uses much simpler loss function. However, its performance is actually very good on this dataset.

Also, could you explain what do you mean by "a new RNA type"?

Method F1 MCC Predicted base pairs per RNA
e2efold 0.2400 0.2401 18.2133
SPOT-RNA 0.7261 0.7307 29.0554
mfold 0.6275 0.6285 32.4903
RNAstructure (ProbablePair) 0.6443 0.6475 29.4238
CONTRAfold 0.6617 0.6642 32.5845
liyu95 commented 4 years ago

Thank you very much for raising that. For your information, SPOT-RNA was trained on bpRNA. Our training datasets are different. If you want to compare them fairly, I would suggest you re-train SPOT-RNA on our dataset. Also, because of the design of their deep learning model, you would have the memory issue and the running time issue if you want to use SPOT-RNA to predict long sequences. Please have a try before you draw a conclusion. For an RNA type, I mean '5s rRNA' and 'tRNA'. RNA has different kinds of families, as shown in Section D.5 in our paper.

xinshi-chen commented 4 years ago

Thank you Chengxin for testing our trained model on your dataset. It is neither unexpected nor surprising to me that the performance is not so good on your dataset. It is very likely that the distribution of your test dataset is very different from our training dataset.

The performance of a deep learning model is mainly determined by the following 3 factors: (1) model design (i.e. deep architecture); (2) training algorithm (e.g., SGD, robust training, loss function design, other tricks, etc); (3) training data.

E2Efold focuses on improving (1) and also has contributions in (2). To fairly compare E2Efold with SPOT-RNA, we suggest you use the SAME training data for both models and then compare their performance on your test dataset.

As Yu mentioned earlier, SPOT-RNA is trained on bpRNA. If their training dataset (after any preprocessing or filtering) is available, you can try to re-train our model on their training dataset. Our codes for training the model are available in this repo. If you encounter any difficulty, we are very happy to help you out.

kad-ecoli commented 4 years ago

This brings to my second question. When e2efold generate the training/validation/test set, does e2efold check sequence redundancy? For example, cd-hit-est clustering at 80% sequence identity cutoff (which is already a very permissive sequence identity cutoff) shows that only 1231 out of all 3966 archiveII RNA sequences can be considered unique. The large redundancy is partly due to the archiveII dataset including both the full length and individual domain sequences. For example

16s_A.pyrophilus_domain3.ct
16s_A.pyrophilus_domain4.ct
16s_A.pyrophilus_domain1.ct
16s_A.pyrophilus_domain2.ct

are just sub-sequences of 16s_A.pyrophilus.ct. If trained on the full length sequence and test on individual domain sequences (or vice versa), a machine learning algorithm can easily show superficially high accuracy. I wondered whether and where this sequence redundancy factor is considered in the archiveII/rnastralign dataset generation.

xinshi-chen commented 4 years ago

There is a preprocessing step where we remove redundant structures. There also won't be the cases as you mentioned that the same sequence appears in both test and train dataset in different forms. Our experiment has the following data split after removing the redundancy: [RNAStralign train, RNAStralign vali, RNAStralign test]. The model is trained on [RNAStralign train], validated on [RNAStralign vali], and tested on both [RNAStralign test] and [ArchiveII(filtered)], where [ArchiveII(filtered)] is also filtered and does not overlap with RNAStralign.

Still, I guess the BEST way to resolve your concerns on both our methodology and the dataset is to train E2Efold on the training set of SPOT-RNA. Give it a try?

kad-ecoli commented 4 years ago

For the RNAStralign dataset, it was mentioned that "After removing redundant sequences and structures, 30451 structures remain." Could you explain a little more on what does the original text mean by "removing redundant sequences and structures"?

HarveyYan commented 3 years ago

@kad-ecoli It appeared that running cd-hit-est with -c 0.8 showed only around 3000 sequences in RNAStrAlign can be considered unique