Failure to reproduce Table 3 of the e2efold paper

kad-ecoli commented 4 years ago

I try to run https://github.com/ml4bio/e2efold/blob/master/e2efold_productive/e2efold_productive_short.py on the ArchiveII dataset used to benchmark e2efold in Table 3 of the e2efold paper. For time's sake, I only run the result on the subset of 3911 target RNAs with up to 600 nucleotides, rather than the full set of 3975 RNAs. Nonetheless, I do not think the small number of 64 (1.6%) out of 3975 would alter the conclusion of the benchmark. Even though the result of e2efold pretrained model on this dataset is much better than what was shown in https://github.com/ml4bio/e2efold/issues/5, it is still worse than what was reported in the paper:

Method	F1	MCC	Predicted base pairs per RNA
e2efold	0.5540	0.5595	42.2986
LinearFold	0.6060	0.6085	56.8372

In particular, e2efold is found to be even worse than LinearFold, a thermodynamics based RNA folding program, although e2efold is supposed to outperform all state-of-the-art algorithms on this dataset according to the original paper. Note that on average, each target in these 3911 RNAs has 59.1386 base pairs; therefore, e2efold is certainly under-predicting many pairs.

I wondered whether such poor performance is caused by inconsistency in packaging details of e2efold_productive. To verify this, could the e2efold team kindly provides the detail table of per target F1 so that I can check the worse offenders? Thank you.

liyu95 commented 4 years ago

Thank you for reproducing the Table 3 performance! Yes, e2efold_productive and experiment_archiveii are slightly different.

For your information, the detailed reproducing code for ArchiveII and Table 3 is in the folder experiment_archiveii. As we discussed in the paper, we reported the performance of the RNA types in ArchiveII, which also have enough training data samples in the training dataset. More specifically, the performance of the following RNA types are reported: ['RNaseP', '5s', 'tmRNA', 'tRNA', 'telomerase', '16s'] We did not include grp1 as their sequence lengths in ArchiveII and RNAStralign are completely different. For SRP, we do not have enough training samples in the RNAStralign dataset, which only contains 468 samples. While in ArchiveII, there are much more SRP sequences, whose number is 928.

You can run main.sh in the folder experiment_archiveii to reproduce the performance in Table 3.

If you want to go into details, you can check e2e_learning_stage3.py and e2e_learning_stage3_rnastralign_all_long.py in the same folder.

kad-ecoli commented 4 years ago

Can you give the list of ArchiveII target (i.e. RNA names, not just the types) on which Table 3 is calculated?

liyu95 commented 4 years ago

Thank you very much for your interest!

Everything (name, sequence length, performance) is stored with the following code (line 228-244 in e2e_learning_stage3.py)

    e2e_result_df = pd.DataFrame()
    e2e_result_df['name'] = [a.name for a in test_data.data]
    e2e_result_df['type'] = list(map(lambda x: x.split('_')[0], [a.name for a in test_data.data]))
    e2e_result_df['seq_lens'] = list(map(lambda x: x.numpy(), seq_lens_list))
    e2e_result_df['exact_p'] = pp_exact_p
    e2e_result_df['exact_r'] = pp_exact_r
    e2e_result_df['exact_f1'] = pp_exact_f1
    e2e_result_df['shift_p'] = pp_shift_p
    e2e_result_df['shift_r'] = pp_shift_r
    e2e_result_df['shift_f1'] = pp_shift_f1
    # pdb.set_trace()
    final_result = e2e_result_df[e2e_result_df['type'].isin(
        ['RNaseP', '5s', 'tmRNA', 'tRNA', 'telomerase', '16s'])]
    to_output = list(map(str, 
        list(final_result[['exact_p', 'exact_r', 'exact_f1', 'shift_p','shift_r', 'shift_f1']].mean().values.round(3))))
    print('Number of sequences: ', len(final_result))
    print(to_output)

Could you please have a check?

kad-ecoli commented 4 years ago

I see. So the paper actually only test on a subset of 2877 (72%) RNAs from the full set of 3975 ArchiveII RNAs. This was not clear in the paper. On this subset, e2efold_productive_short.py (or e2efold_productive_long.py if >600 nucleotides) indeed outperforms LinearFold. However, the performance is still not as impressive as that reported in the paper. F1 is <0.7 for e2efold_productive, even though the paper reports F1=0.821.

Method	F1	MCC	Predicted base pairs per RNA
e2efold	0.6894	0.6943	36.5262
LinearFold	0.6186	0.6216	50.6879

liyu95 commented 4 years ago

Thank you for running the code and reproduce the results! In the Section Test On ArchiveII Without Re-training in our paper, we made it clear that we tested on the overlapping RNA families in ArchiveII, instead of the entire set. In Table 3, we also reported the performance as F1=0.686, which you have reproduced. We have never claimed we can reach F1>0.8 on the ArchiveII dataset. Could you please make it clear what your concern is?

kad-ecoli commented 4 years ago

Sorry, I mixed Table 2 with Table 3. You are correct that e2efold should have F1=0.69 on this dataset.

The main reason for my misunderstanding was that the original text said "We then test the model on sequences in ArchiveII that have overlapping RNA types (5SrRNA, 16SrRNA, etc) with the RNAStralign dataset" which apparently should have included the "SRP" RNA type shared by both datasets with hundreds of RNAs.

In fact, "SRP" was excluded from Table 3. This exclusion made e2efold appeared better than SOTA.

ml4bio / e2efold

Failure to reproduce Table 3 of the e2efold paper #6