microsoft / Graphormer

Graphormer is a general-purpose deep learning backbone for molecular modeling.
MIT License
2.08k stars 334 forks source link

Errors while running Graphormer V2 on MolPCBA #109

Closed ZhuYun97 closed 2 years ago

ZhuYun97 commented 2 years ago

Thanks for your fabulous codes. When I use such command below to run Graphormer V2 on MolPCBA dataset, some errors happen.

n_gpu=1
epoch=5
max_epoch=$((epoch + 1))
batch_size=128
tot_updates=$((33000*epoch/batch_size/n_gpu))
warmup_updates=$((tot_updates*16/100))

CUDA_VISIBLE_DEVICES=2 fairseq-train \
--user-dir ./graphormer \
--num-workers 16 \
--ddp-backend=legacy_ddp \
--dataset-name ogbg-molpcba \
--dataset-source ogb \
--task graph_prediction_with_flag \
--criterion binary_logloss_with_flag \
--arch graphormer_base \
--num-classes 1 \
--attention-dropout 0.1 --act-dropout 0.1 --dropout 0.0 \
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-8 --clip-norm 5.0 --weight-decay 0.0 \
--lr-scheduler polynomial_decay --power 1 --warmup-updates $warmup_updates --total-num-update $tot_updates \
--lr 2e-4 --end-learning-rate 1e-5 \
--batch-size $batch_size \
--fp16 \
--data-buffer-size 20 \
--encoder-layers 12 \
--encoder-embed-dim 768 \
--encoder-ffn-embed-dim 768 \
--encoder-attention-heads 32 \
--max-epoch $max_epoch \
--save-dir ./ckpts_pcba \
--pretrained-model-name pcqm4mv2_graphormer_base \
--seed 1 \
--flag-m 3 \
--flag-step-size 0.01 \
--flag-mag 0 

The error shows the mask shape is incorrect, I check the targets shape which is [128, 128] (should be [128, 1]) and the content of targets is also weird which is

tensor([[nan, 0., 0.,  ..., nan, nan, nan],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, 0., 0.,  ..., nan, nan, nan],
        [0., 0., 0.,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       dtype=torch.float16)
Traceback (most recent call last):                                                                                                          
  File "/home/XX/anaconda3/envs/graphormer/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/home/XX/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq_cli/train.py", line 529, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/XX/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/home/XX/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq_cli/train.py", line 189, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/home/XX/anaconda3/envs/graphormer/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/XX/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq_cli/train.py", line 304, in train
    log_output = trainer.train_step(samples)
  File "/home/XX/anaconda3/envs/graphormer/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/XX/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq/trainer.py", line 770, in train_step
    loss, sample_size_i, logging_output = self.task.train_step(
  File "/home/XX/Graphormer/graphormer/tasks/graph_prediction.py", line 328, in train_step
    loss, sample_size, logging_output = criterion(
  File "/home/XX/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/XX/Graphormer/graphormer/criterions/binary_logloss.py", line 103, in forward
    logits_flatten[mask].float(), targets_flatten[mask].float(), reduction="sum"
IndexError: The shape of the mask [16384] at index 0 does not match the shape of the indexed tensor [128] at index 0

Are there any mistakes while I using Graphormer V2, could you help me figure them out. Thanks a lot.

ZhuYun97 commented 2 years ago

There are two main things which should be noted. (1) The class_num should be set as 128 for the ogbg-molpcba dataset. Because this dataset has multiple tasks and can contain nan that indicates the corresponding label is not assigned to the molecule. (2) And if you use the pre-trained model trained on other datasets(e.g. PCQM4MV1), the shape will mismatch for the encoder.embed_out layer while loading pre-trained model. In such a situation, you need some extra operations.