can not reproduce PCQM4M-LSC results

chaoyan1037 commented 2 years ago

Hi there,

I tried to reproduce the results of Graphormer_SMALL on the PCQM4M-LSC dataset using the v1.0 branch.

[ -z "${exp_name}" ] && exp_name="pcq"
[ -z "${seed}" ] && seed="1"
[ -z "${arch}" ] && arch="--ffn_dim 512 --hidden_dim 512 --weight_decay 0.0 --intput_dropout_rate 0.0 --dropout_rate 0.1 --n_layers 6 --peak_lr 3e-4 --edge_type multi_hop --multi_hop_max_dist 5"
[ -z "${batch_size}" ] && batch_size="256"

echo -e "\n\n"
echo "=====================================ARGS======================================"
echo "arg0: $0"
echo "exp_name: ${exp_name}"
echo "arch: ${arch}"
echo "seed: ${seed}"
echo "batch_size: ${batch_size}"
echo "==============================================================================="

default_root_dir="../../exps/pcq/$exp_name/$seed"
mkdir -p $default_root_dir
n_gpu=$(nvidia-smi -L | wc -l)

python ../../graphormer/entry.py --num_workers 8 --seed $seed --batch_size $batch_size \
      --dataset_name PCQM4M-LSC \
      --gpus $n_gpu --accelerator ddp --precision 16 --gradient_clip_val 5.0 \
      $arch \
      --default_root_dir $default_root_dir

My machine also has 8 GPUs. In this case, is the equivalent batch size 256 * 8 = 2048? After training, according to Tensorboard curves, the valid_mae is about 0.135, while it is 0.1264 for Graphormer_SMALL as reported in the paper. Do you have any ideas about the performance gap? It is suggested to set batch size 1024, so should I change the batch size to 128 (1024/8) for an 8-GPU machine?

Do you run baseline methods in Table 1 to report their performance?

I will be very appreciated if you can answer these questions!

skye95git commented 2 years ago

@chaoyan1037 Hi, how do you reproduce the results of Graphormer_SMALL on the PCQM4M-LSC dataset using the v1.0 branch? I create the environment follow https://github.com/microsoft/Graphormer/blob/v1.0/README.md:

# create a new environment
conda create --name graphormer python=3.7
conda activate graphormer
# install requirements
pip install rdkit-pypi cython
pip install ogb==1.3.1 pytorch-lightning==1.3.0
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch-geometric==1.6.3 ogb==1.3.1 pytorch-lightning==1.3.1 tqdm torch-sparse==0.6.9 torch-scatter==2.0.6 -f https://pytorch-geometric.com/whl/torch-1.7.0+cu110.html

Then I run the command:

[ -z "${exp_name}" ] && exp_name="pcq"
[ -z "${seed}" ] && seed="1"
[ -z "${arch}" ] && arch="--ffn_dim 768 --hidden_dim 768 --dropout_rate 0.1 --n_layers 12 --peak_lr 2e-4 --edge_type multi_hop --multi_hop_max_dist 5"
[ -z "${batch_size}" ] && batch_size="256"

echo -e "\n\n"
echo "=====================================ARGS======================================"
echo "arg0: $0"
echo "exp_name: ${exp_name}"
echo "arch: ${arch}"
echo "seed: ${seed}"
echo "batch_size: ${batch_size}"
echo "==============================================================================="

default_root_dir="../../exps/pcq/$exp_name/$seed"
mkdir -p $default_root_dir
n_gpu=$(nvidia-smi -L | wc -l)

python ../../graphormer/entry.py --num_workers 8 --seed $seed --batch_size $batch_size \
      --dataset_name PCQM4M-LSC \
      --gpus $n_gpu --accelerator ddp --precision 16 --gradient_clip_val 5.0 \
      $arch \
      --default_root_dir $default_root_dir

There is an error:

Traceback (most recent call last):
  File "../../graphormer/entry.py", line 4, in <module>
    from model import Graphormer
  File "/home/linjiayi/Graphormer/graphormer/model.py", line 4, in <module>
    from data import get_dataset
  File "/home/linjiayi/Graphormer/graphormer/data.py", line 5, in <module>
    from wrapper import MyGraphPropPredDataset, MyPygPCQM4MDataset, MyZINCDataset
  File "/home/linjiayi/Graphormer/graphormer/wrapper.py", line 6, in <module>
    import torch_geometric.datasets
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/site-packages/torch_geometric/__init__.py", line 2, in <module>
    import torch_geometric.nn
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
    from .data_parallel import DataParallel
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
    from torch_geometric.data import Batch
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/site-packages/torch_sparse/__init__.py", line 15, in <module>
    f'{library}_{suffix}', [osp.dirname(__file__)]).origin)
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/site-packages/torch/_ops.py", line 105, in load_library
    ctypes.CDLL(path)
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcusparse.so.11: cannot open shared object file: No such file or directory

How to solve it?

chaoyan1037 commented 2 years ago

Hi @skye95git,

I do not have this problem. I guess it is related to your CUDA installation. You may find a similar issue here.

skye95git commented 2 years ago

Hi @skye95git,

I do not have this problem. I guess it is related to your CUDA installation. You may find a similar issue here.

Thanks for your reply! I have solved it. The cause was a torch-geometric installation error.

Vermouth1103 commented 2 years ago

Hi @skye95git, I do not have this problem. I guess it is related to your CUDA installation. You may find a similar issue here.

Thanks for your reply! I have solved it. The cause was a torch-geometric installation error.

I meet the same problem. Can you tell me how you solve it?

skye95git commented 2 years ago

Hi @skye95git, I do not have this problem. I guess it is related to your CUDA installation. You may find a similar issue here.

Thanks for your reply! I have solved it. The cause was a torch-geometric installation error.

I meet the same problem. Can you tell me how you solve it?

I reinstalled torch-geometric.

conda install pytorch==1.9.0 torchvision==0.10.0 torchaudio==0.9.0 cudatoolkit=10.2 -c pytorch

wget https://data.pyg.org/whl/torch-1.9.0%2Bcu102/torch_cluster-1.5.9-cp39-cp39-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.9.0%2Bcu102/torch_scatter-2.0.7-cp39-cp39-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.9.0%2Bcu102/torch_sparse-0.6.11-cp39-cp39-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.9.0%2Bcu102/torch_spline_conv-1.2.1-cp39-cp39-linux_x86_64.whl

pip install torch_cluster-1.5.9-cp39-cp39-linux_x86_64.whl
pip install torch_scatter-2.0.7-cp39-cp39-linux_x86_64.whl
pip install torch_sparse-0.6.11-cp39-cp39-linux_x86_64.whl
pip install torch_spline_conv-1.2.1-cp39-cp39-linux_x86_64.whl
pip install torch-geometric

update: follow it to install torch-geometric:https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

Vermouth1103 commented 2 years ago

Hi @skye95git, I do not have this problem. I guess it is related to your CUDA installation. You may find a similar issue here.

Thanks for your reply! I have solved it. The cause was a torch-geometric installation error.

I meet the same problem. Can you tell me how you solve it?

I reinstalled torch-geometric.
conda install pytorch==1.9.0 torchvision==0.10.0 torchaudio==0.9.0 cudatoolkit=10.2 -c pytorch

wget https://data.pyg.org/whl/torch-1.9.0%2Bcu102/torch_cluster-1.5.9-cp39-cp39-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.9.0%2Bcu102/torch_scatter-2.0.7-cp39-cp39-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.9.0%2Bcu102/torch_sparse-0.6.11-cp39-cp39-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-1.9.0%2Bcu102/torch_spline_conv-1.2.1-cp39-cp39-linux_x86_64.whl

pip install torch_cluster-1.5.9-cp39-cp39-linux_x86_64.whl
pip install torch_scatter-2.0.7-cp39-cp39-linux_x86_64.whl
pip install torch_sparse-0.6.11-cp39-cp39-linux_x86_64.whl
pip install torch_spline_conv-1.2.1-cp39-cp39-linux_x86_64.whl
pip install torch-geometric
update: follow it to install torch-geometric:https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

Thanks a lot! I've run the example.

zhengsx commented 2 years ago

Thanks for using Graphormer and sorry for late response. Does the issue still be there? If yes, please kindly try the well-prepared scripts for training PCQ using V2, which should be easily reproduced for expected results. link

chaoyan1037 commented 2 years ago

Thanks for your response! Do you refer to V2 dataset or V2 code?

I still can not reproduce results with V1 code. I find the V1 code is easier to understand. Is there a script for the V1 code?

zhengsx commented 2 years ago

Yes, the script for V1 code could be find there: https://github.com/microsoft/Graphormer/tree/v1.0/examples/ogb-lsc#example-usage, but V2 code is more easier to use. Please feel free to have a try.

chaoyan1037 commented 2 years ago

Thank you very much for your quick response!

I used this script for the V1 code but can not reproduce the results. That is exactly my original question.

zhengsx commented 2 years ago

That is the script for Graphormer_Base, will you try to run for Base model? If the problem still exist, we will look into it, in the meanwhile, V2 code with scripts is highly recommended, with higher performance, faster training speed, and robust training process.

chaoyan1037 commented 2 years ago

Yes, I am trying to train the Graphormer on PCQM4M-LSC from scratch. Thanks for the reminder! Since I have adapted the V1 for my own research, I prefer to use V1. Later may switch to V2 for these benefits.

microsoft / Graphormer

can not reproduce PCQM4M-LSC results #103