Reproduce BBBP result - Githubissues

yuhui-zh15 commented 3 years ago

Hi, thanks for your great work and clear documentation! I'm trying to reproduce your result on BBBP. However, while I followed the exact setting in the README, there seems to be a huge gap between my result (89.4) and the reported number (93.6). I listed all the steps that are fully reproducible. Could you check if there is anything wrong with my side? Thanks a lot for your help in advance!

Create Conda environment:

git clone git@github.com:tencent-ailab/grover.git
cd grover
conda create --name chem --file requirements.txt
conda activate chem

Download the model:

wget https://ai.tencent.com/ailab/ml/ml-data/grover-models/pretrain/grover_base.tar.gz
tar -xvf grover_base.tar.gz

Feature extraction and fine-tuning:

python scripts/save_features.py --data_path exampledata/finetune/bbbp.csv \
                                --save_path exampledata/finetune/bbbp.npz \
                                --features_generator rdkit_2d_normalized \
                                --restart 

python main.py finetune --data_path exampledata/finetune/bbbp.csv \
                        --features_path exampledata/finetune/bbbp.npz \
                        --save_dir model/finetune/bbbp/ \
                        --checkpoint_path grover_base.pt \
                        --dataset_type classification \
                        --split_type scaffold_balanced \
                        --ensemble_size 1 \
                        --num_folds 3 \
                        --no_features_scaling \
                        --ffn_hidden_size 200 \
                        --batch_size 32 \
                        --epochs 10 \
                        --init_lr 0.00015

The training log (quiet.log) is:

Fold 0
Model 0 best val loss = 0.470996 on epoch 9
Model 0 test auc = 0.887339
Ensemble test auc = 0.887339
Fold 1
Model 0 best val loss = 0.476553 on epoch 7
Model 0 test auc = 0.891758
Ensemble test auc = 0.891758
Fold 2
Model 0 best val loss = 0.488360 on epoch 9
Model 0 test auc = 0.904175
Ensemble test auc = 0.904175
3-fold cross validation
Seed 0 ==> test auc = 0.887339
Seed 1 ==> test auc = 0.891758
Seed 2 ==> test auc = 0.904175
overall_scaffold_balanced_test_auc=0.894424
std=0.007127

TWRogers commented 3 years ago

Firstly, thanks to the authors for the easy-to-use codebase, it's deeply appreciated!

I can confirm that I have the same issue as @yuhui-zh15 for BBBP, in my case I get

3-fold cross validation
Seed 0 ==> test auc = 0.901969
Seed 1 ==> test auc = 0.903515
Seed 2 ==> test auc = 0.876906
overall_scaffold_balanced_test_auc=0.894130
std=0.012196

I have done multiple runs and played around with a few things including activating and deactivating args.dense as well as changing the split type to random just in case, but I can't get an AUC close to the one stated in the paper. Some lucky random folds get to 0.95 AUC but this disappears in the averaging.

I will experiment with the large model and some of the other endpoints to see if I have any luck reproducing any of the results.

TWRogers commented 3 years ago

p.s. the downloadable fine-tuned models seem to be much larger than the base model and of varying sizes, so perhaps different hyperparameters were used for each endpoint and even ensembles in some cases? Unfortunately I am having difficulties downloading them to verify.

yuhui-zh15 commented 3 years ago

I tried to finetune the large model, but it seems it is even worse than the base model.

Fold 0
Model 0 best val loss = 0.486441 on epoch 7
Model 0 test auc = 0.893492
Ensemble test auc = 0.893492
Fold 1
Model 0 best val loss = 0.479239 on epoch 8
Model 0 test auc = 0.888364
Ensemble test auc = 0.888364
Fold 2
Model 0 best val loss = 0.490516 on epoch 0
Model 0 test auc = 0.892271
Ensemble test auc = 0.892271
3-fold cross validation
Seed 0 ==> test auc = 0.893492
Seed 1 ==> test auc = 0.888364
Seed 2 ==> test auc = 0.892271
overall_scaffold_balanced_test_auc=0.891375
std=0.002187

WenjinW commented 3 years ago

Thanks to the author for providing the source code. Unfortunately, I get the same results as @yuhui-zh15 and @TWRogers on BBBP, and the results are as follows

Model 0 test auc = 0.895133
Ensemble test auc = 0.895133
1-fold cross validation
Seed 0 ==> test auc = 0.895133
overall_scaffold_balanced_test_auc=0.895133
std=0.000000

The test auc (0.895133) is lower than the value reported in the paper (0.936). Are there any special tricks that need to be considered?

wuhaoxz commented 9 months ago

Hello, I also encountered the same problem as you. Have you solved it? @yuhui-zh15 @TWRogers @WenjinW

tencent-ailab / grover

Reproduce BBBP result #2