Open dongZheX opened 2 years ago
@dongZheX Thanks for using Graphormer. Could you please try to set CUDA_VISIBLE_DEVICES=0,1,2,3 in your script, without changing anything else?
I found that there's a mistake in the hiv_pre.sh
script for setting the visible GPUs and actually 4 GPUs are used.
I've tried this and can reproduce the result.
With 2 GPUs, we may need to adjust the warmup steps and power of polynomial learning rate scheduler. I'll come back when I find the correct setting for 2 GPUs.
@dongZheX Could you please try this setting for 2 GPUs (doubling the batch size per GPU to 128, and the epoch to 16, without changing anything else)?
#!/usr/bin/env bash
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
n_gpu=2
epoch=16
max_epoch=$((epoch + 1))
batch_size=128
tot_updates=$((33000*epoch/batch_size/n_gpu))
warmup_updates=$((tot_updates/10))
CUDA_VISIBLE_DEVICES=0,1 fairseq-train \
--user-dir ../../graphormer \
--num-workers 16 \
--ddp-backend=legacy_ddp \
--dataset-name ogbg-molhiv \
--dataset-source ogb \
--task graph_prediction_with_flag \
--criterion binary_logloss_with_flag \
--arch graphormer_base \
--num-classes 1 \
--attention-dropout 0.1 --act-dropout 0.1 --dropout 0.0 \
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-8 --clip-norm 5.0 --weight-decay 0.0 \
--lr-scheduler polynomial_decay --power 1 --warmup-updates $warmup_updates --total-num-update $tot_updates \
--lr 2e-4 --end-learning-rate 1e-9 \
--batch-size $batch_size \
--fp16 \
--data-buffer-size 20 \
--encoder-layers 12 \
--encoder-embed-dim 768 \
--encoder-ffn-embed-dim 768 \
--encoder-attention-heads 32 \
--max-epoch $max_epoch \
--save-dir ./ckpts \
--pretrained-model-name pcqm4mv1_graphormer_base \
--seed 1 \
--flag-m 3 \
--flag-step-size 0.001 \
--flag-mag 0.001
I got an AUC of 0.818716
on test set, with the optimal epoch on valid set (AUC 0.824427
on valid set).
Yet another setting following the warmup ratio in the original paper of Graphormer (0.06) without changing anything else, gets AUC 0.816581
on test set, with the optimal epoch on valid set (AUC 0.805187
on valid set).
#!/usr/bin/env bash
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
n_gpu=2
epoch=8
max_epoch=$((epoch + 1))
batch_size=64
tot_updates=$((33000*epoch/batch_size/n_gpu))
warmup_updates=$((tot_updates * 6 / 100))
CUDA_VISIBLE_DEVICES=0,1 fairseq-train \
--user-dir ../../graphormer \
--num-workers 16 \
--ddp-backend=legacy_ddp \
--dataset-name ogbg-molhiv \
--dataset-source ogb \
--task graph_prediction_with_flag \
--criterion binary_logloss_with_flag \
--arch graphormer_base \
--num-classes 1 \
--attention-dropout 0.1 --act-dropout 0.1 --dropout 0.0 \
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-8 --clip-norm 5.0 --weight-decay 0.0 \
--lr-scheduler polynomial_decay --power 1 --warmup-updates $warmup_updates --total-num-update $tot_updates \
--lr 2e-4 --end-learning-rate 1e-9 \
--batch-size $batch_size \
--fp16 \
--data-buffer-size 20 \
--encoder-layers 12 \
--encoder-embed-dim 768 \
--encoder-ffn-embed-dim 768 \
--encoder-attention-heads 32 \
--max-epoch $max_epoch \
--save-dir ./ckpts_$1_$2_$3_$4_$5_2gpu_2 \
--pretrained-model-name pcqm4mv1_graphormer_base \
--seed 1 \
--flag-m 3 \
--flag-step-size 0.001 \
--flag-mag 0.001
Hi, without any changing, I change the CUDA_VISIBLE_DEVICES to 0,1,2,3
With seed 1 (use pcqm4mv1_graphormer_base), the result is:
{'epoch-best': {'val': {'auc': 0.8268111745401395}, 'test': {'auc': 0.8071068343864125}}}
With seed 2, the result is: {'epoch-best': {'val': {'auc': 0.8010994364928343}, 'test': {'auc': 0.7638590999555581}}}
The result is also wrong, the full log is here: {'epoch-1': {'val': {'auc': 0.6367309230302529}, 'test': {'auc': 0.7048402991130949}}}
{'epoch-2': {'val': {'auc': 0.6810684267456005}, 'test': {'auc': 0.724410179120051}}} {'epoch-3': {'val': {'auc': 0.7854260316409731}, 'test': {'auc': 0.7728344250574847}}} {'epoch-4': {'val': {'auc': 0.7775173204146482}, 'test': {'auc': 0.789851796031148}}} {'epoch-5': {'val': {'auc': 0.7927617366684133}, 'test': {'auc': 0.7691225629432109}}} {'epoch-6': {'val': {'auc': 0.7645725894671042}, 'test': {'auc': 0.7639537804571715}}} {'epoch-7': {'val': {'auc': 0.8010994364928343}, 'test': {'auc': 0.7638590999555581}}} {'epoch-8': {'val': {'auc': 0.7837805539468484}, 'test': {'auc': 0.7722566807721292}}} {'epoch-9': {'val': {'auc': 0.7525578445161468}, 'test': {'auc': 0.7359650648271598}}} {'epoch-best': {'val': {'auc': 0.8010994364928343}, 'test': {'auc': 0.7638590999555581}}} {'epoch-last': {'val': {'auc': 0.7525578445161468}, 'test': {'auc': 0.7359650648271598}}}
I'll try to change the epoch and batch_size.
On 2/27/2022 @.***> wrote:
@dongZheX Thanks for using Graphormer. Could you please try to set CUDA_VISIBLE_DEVICES=0,1,2,3 in your script, without changing anything else? I found that there's a mistake in the hiv_pre.sh script for setting the visible GPUs and actually 4 GPUs are used. I've tried this and can reproduce the result.
With 2 GPUs, we may need to adjust the warmup steps and power of polynomial learning rate scheduler. I'll come back when I find the correct setting for 2 GPUs.
— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>
with scripts:
n_gpu=2
epoch=8
max_epoch=$((epoch + 1))
batch_size=64
tot_updates=$((33000*epoch/batch_size/n_gpu))
warmup_updates=$((tot_updates * 6 / 100))
CUDA_VISIBLE_DEVICES=6,7 fairseq-train \
--user-dir graphormer \
--num-workers 16 \
--ddp-backend=legacy_ddp \
--dataset-name ogbg-molhiv \
--dataset-source ogb \
--task graph_prediction_with_flag \
--criterion binary_logloss_with_flag \
--arch graphormer_base \
--num-classes 1 \
--attention-dropout 0.1 --act-dropout 0.1 --dropout 0.0 \
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-8 --clip-norm 5.0 --weight-decay 0.0 \
--lr-scheduler polynomial_decay --power 1 --warmup-updates $warmup_updates --total-num-update $tot_updates \
--lr 2e-4 --end-learning-rate 1e-9 \
--batch-size $batch_size \
--fp16 \
--data-buffer-size 20 \
--encoder-layers 12 \
--encoder-embed-dim 768 \
--encoder-ffn-embed-dim 768 \
--encoder-attention-heads 32 \
--max-epoch $max_epoch \
--save-dir $save_dir_root \
--pretrained-model-name pcqm4mv1_graphormer_base \
--seed 1 \
--flag-m 3 \
--flag-step-size 0.001 \
--flag-mag 0.001 \
--tensorboard-logdir $tensorboard_dir_root \
--log-format simple --log-interval 100 \
--log-file $log_dir
I got the result:
{'epoch-1': {'val': {'auc': 0.6695133124354602}, 'test': {'auc': 0.7005593878615732}}}
{'epoch-2': {'val': {'auc': 0.6913963272447594}, 'test': {'auc': 0.7289896237899252}}}
{'epoch-3': {'val': {'auc': 0.7223340656781544}, 'test': {'auc': 0.7466407744478581}}}
{'epoch-4': {'val': {'auc': 0.49609926796159937}, 'test': {'auc': 0.49780109365640635}}}
{'epoch-5': {'val': {'auc': 0.535256734354939}, 'test': {'auc': 0.4508347342183062}}}
{'epoch-6': {'val': {'auc': 0.4679928542756374}, 'test': {'auc': 0.4747376963654281}}}
{'epoch-7': {'val': {'auc': 0.5192539275438257}, 'test': {'auc': 0.49205360075744403}}}
{'epoch-8': {'val': {'auc': 0.4971748036611112}, 'test': {'auc': 0.5593279616640581}}}
{'epoch-9': {'val': {'auc': 0.5475518539967947}, 'test': {'auc': 0.5567580623345506}}}
{'epoch-best': {'val': {'auc': 0.535256734354939}, 'test': {'auc': 0.4508347342183062}}}
{'epoch-last': {'val': {'auc': 0.5475518539967947}, 'test': {'auc': 0.5567580623345506}}}
It is strange.
@dongZheX Could you please provide your log when fine-tuning the model? We can check the loss of valid and training set when fine-tuning. Thanks.
My log is in the attached file. The "hiv_base_warmup006" represents that we set the warmup_updates=$((tot_updates * 6 / 100)) . The "hiv_base_v4" represents that we set the n_gpu=4 and use 4 x 3090. to train the model. The log file is in the "logs" directory and the results is in the "result" directory. exp.zip
@dongZheX Thanks. It seems that the attached files cannot be downloaded. Could you please double check?
@dongZheX Thanks. It seems that the attached files cannot be downloaded. Could you please double check?
I put the files in my repository: https://github.com/dongZheX/myexp
@dongZheX Thanks for reporting the issue. After checking, we found that we used the wrong checkpoint for finetuning MolHIV. #96 is opened to fix this. And following lists the AUC of valid and test sets on MolHIV with 10 seeds, with an average AUC of 0.805 on test.
seed 0, valid best 0.82628413, test 0.81082449
seed 1, valid best 0.81708539, test 0.80874539
seed 2, valid best 0.81538782, test 0.81652658
seed 3, valid best 0.8399321, test 0.80314764
seed 4, valid best 0.79637443, test 0.79757309
seed 5, valid best 0.8203549, test 0.82088575
seed 6, valid best 0.79118367, test 0.798203
seed 7, valid best 0.82911852, test 0.80339304
seed 8, valid best 0.81123889, test 0.80003865
seed 9, valid best 0.79634073, test 0.7965915
@dongZheX Thanks for reporting the issue. After checking, we found that we used the wrong checkpoint for finetuning MolHIV. #96 is opened to fix this. And following lists the AUC of valid and test sets on MolHIV with 10 seeds, with an average AUC of 0.805 on test.
seed 0, valid best 0.82628413, test 0.81082449 seed 1, valid best 0.81708539, test 0.80874539 seed 2, valid best 0.81538782, test 0.81652658 seed 3, valid best 0.8399321, test 0.80314764 seed 4, valid best 0.79637443, test 0.79757309 seed 5, valid best 0.8203549, test 0.82088575 seed 6, valid best 0.79118367, test 0.798203 seed 7, valid best 0.82911852, test 0.80339304 seed 8, valid best 0.81123889, test 0.80003865 seed 9, valid best 0.79634073, test 0.7965915
Thanks, let's me try again.
@dongZheX Thanks for reporting the issue. After checking, we found that we used the wrong checkpoint for finetuning MolHIV. #96 is opened to fix this. And following lists the AUC of valid and test sets on MolHIV with 10 seeds, with an average AUC of 0.805 on test.
seed 0, valid best 0.82628413, test 0.81082449 seed 1, valid best 0.81708539, test 0.80874539 seed 2, valid best 0.81538782, test 0.81652658 seed 3, valid best 0.8399321, test 0.80314764 seed 4, valid best 0.79637443, test 0.79757309 seed 5, valid best 0.8203549, test 0.82088575 seed 6, valid best 0.79118367, test 0.798203 seed 7, valid best 0.82911852, test 0.80339304 seed 8, valid best 0.81123889, test 0.80003865 seed 9, valid best 0.79634073, test 0.7965915
Thanks. It works now. By the way, If I want to train hiv with more gpus, how I change the settings of tot_updates, warmup_updates, batch_size and epoch? if I want to reproduce the result of pcba, can I use the pretrain model pcqm4mv1_graphormer_base directly(Is there any possible to offer the script of training pcba.)? If I want to repretrain the pcqm4mv1 for hiv, Is ok to just add to --pre-layernorm to the pcqv1.sh? Thanks again.
``Thanks for the code. Good job!
At present, I am trying do some work based on Graphormer. And I try to reproduce the result in ogbg-molhiv, but meet some problems. I train the model in 2 x RTX 3090(24G), CUDA_VERSION:11.1, and the version of pytorch is same as the github project.
I train the model use this script
And evalute the model use:
I use seeds 1-5 util now. And the result is: {'epoch-best': {'val': {'auc': 0.7915973390450101}, 'test': {'auc': 0.7689351341951192}}}(seed-1) {'epoch-best': {'val': {'auc': 0.7967697158563377}, 'test': {'auc': 0.7800533302417252}}}(seed-2) {'epoch-best': {'val': {'auc': 0.7556909933843831}, 'test': {'auc': 0.7775153131219446}}}(seed-3) {'epoch-best': {'val': {'auc': 0.7953004299078593}, 'test': {'auc': 0.799790350317856}}}(seed-4) {'epoch-best': {'val': {'auc': 0.7998829473968052}, 'test': {'auc': 0.7942418796977954}}}(seed-5)
And the results with pretrain model pcqm4mv2_graphormer_base are also not optimistic. emmm, I don't know what happens. Looking forward to your reply.