Reproduce performance problems #19

DaHaiHuha commented 4 years ago

Great work and thanks for releasing the codes + dataset + pre-trained model.

But I still have some questions about the training procedure, could you kindly spare some time to review the process? I’ve used the codes in Github repo to train the model for several times but failed to reproduce the result (the gap is about 5%), I was wondering that there are some differences between how your released model is trained and how I trained the models.

The training details are as follows:

python train_similarity_and_contact.py \
    --rnn-type lstm \
    --embedding-dim 100 \
    --input-dim 512 \
    --rnn-dim 512 \
    --hidden-dim 50 \
    --width 7 \
    --num-layers 3 \
    --dropout 0 \
    --epoch-scale 5 \
    --epoch-size 100000 \
    --num-epochs 100 \
    --similarity-batch-size 64 \
    --contact-batch-size 10 \
    --weight-decay 0 \
    --lr 0.001 \
    --tau 0.5 \
    --lambda 0.1 \
    --augment 0.05 \
    --lm /embedding/data/raw/bepler/pretrained_models/pfam_lm_lstm2x1024_tied_mb64.sav \
    --output /embedding/v-dache/save_logs/train_lambda0.1_augment0.05.txt \
    --save-prefix /embedding/v-dache/save_logs/train_lambda0.1_augment0.05 \
    --device -2

Here are the questions I would like to know the answers, could you kindly answer them? Is the released LM same as the one you used for training? Shall I modify the code to reproduce the result? I noticed that when loading the samples for SCOP task, the number is 22408 but after resampled to match CMAP task, the number is only about 10% left, that is 2241. So I was wondering whether the resampled dataset matters. Is the released model obtained by searching some hyperparameters? If yes, how does it be done?

Besides, I revised the source code a little bit and submitted a PR to Github: https://github.com/tbepler/protein-sequence-embedding-iclr2019/pull/18/commits/dc75f65c1734e7b696825fdefbb4bdc64385d6ae Will this lead to a performance drop?

The evaluation of the models are as follows: image Results from eval_similarity.py

image Results from eval_similarity.py & eval_secstr.py

image Results from eval_contact_scop.py

image Results from eval_transmembrane.py

Any suggestions will be appreciated!

amelvim commented 3 years ago

Hi! Thanks for sharing the code, datasets and models!

Any idea on what might be happening here? I am facing the same problem when trying to reproduce the similarity results reported in the paper. I run "eval_similarity.py" with the provided SSA models (full and without contact prediction) and test datasets (2.06-test and 2.07-new). The obtained results present exactly the same performance drop as the ones reported by @DaHaiHuha (using PyTorch version 1.2.0).

I would appreciate any kind of help! Thank you so much!

tbepler commented 3 years ago

Strange. Is this still the case if you use pytorch 0.4.0?

amelvim commented 3 years ago

Thanks for your response! Yes, same results (performance drop) when running it in a conda environment with pytorch 0.4.0. These are the rest of libraries:

tbepler commented 3 years ago

I get the expected performance metrics with the following conda environment:

and command:

python eval_similarity.py pretrained_models/ssa_L1_100d_lstm3x512_lm_i512_mb64_tau0.5_lambda0.1_p0.05_epoch100.sav

If you rollback all of your packages to the above versions, do you get the expected results?

tbepler commented 3 years ago

I can't reproduce this error with newer packages either. The following conda environment gives the expected output for me.

amelvim commented 3 years ago

Sorry, I still cannot reproduce the results. Is there any chance we are using different training or testing pairs? I mean the ones downloaded from this github repository vs the ones you hold locally.

These are my train/test results using eval_similarity.py:

Model       Dataset     Acc Pearson Spearm  Class   Fold    Supfam  Family
SSA-similarity  2.06-train  0.97809 0.94249 0.70616 0.97972 0.88670 0.91647 0.72263
        2.06-test   0.91549 0.82025 0.65051 0.83429 0.77623 0.84859 0.52733
SSA-lambda0.1   2.06-train  0.99816 0.99562 0.71213 1.00000 0.99296 0.99575 0.92497
        2.06-test   0.94926 0.89735 0.68512 0.89839 0.88330 0.94330 0.65151

Pre-trained models: SSA-similarity: ssa_L1_100d_lstm3x512_lm_i512_mb64_tau0.5_p0.05_epoch100.sav SSA-lambda0.1: ssa_L1_100d_lstm3x512_lm_i512_mb64_tau0.5_lambda0.1_p0.05_epoch100.sav

Datasets: 2.06-train: astral-scopedom-seqres-gd-sel-gs-bib-95-2.06.train.sampledpairs.txt 2.06-test: astral-scopedom-seqres-gd-sel-gs-bib-95-2.06.test.sampledpairs.txt

Thanks for your kind help!

tbepler commented 3 years ago

Ok, I managed to reproduce this error and figured out the issue. I had tweaked the BiLM model code when I released the code and it's causing the performance drop. I'll revert the code and it should solve the problem.

tbepler commented 3 years ago

This should now be fixed with commit 89a0ac2f92fea164c9d39eed167348639f3c82a7.