microsoft / Graphormer

Graphormer is a general-purpose deep learning backbone for molecular modeling.
MIT License
2k stars 324 forks source link

No response after running the example scripts #100

Closed skye95git closed 2 years ago

skye95git commented 2 years ago

I install Graphormer follow the guide: 1) To create and activate a conda environment with Python3.9

conda create -n graphormer python=3.9
conda activate graphormer

2) Run the following commands

git clone --recursive https://github.com/microsoft/Graphormer.git
cd Graphormer
bash install.sh

3) To train a Graphormer-slim on ZINC-500K on a single GPU card

cd examples/property_prediction/
bash zinc.sh

I ran this command for half an hour and got no response: image Occasional reactions occur, but the following error is reported:

2022-03-17 17:00:38 | WARNING | root | The OGB package is out of date. Your version is 1.3.2, while the latest ve
rsion is 1.3.3.
Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/graphormer/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq_cli/train.py", line 512, in
cli_main
    parser = options.get_training_parser()
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq/options.py", line 38, in get
_training_parser
    parser = get_parser("Trainer", default_task)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq/options.py", line 234, in ge
t_parser
    utils.import_user_module(usr_args)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq/utils.py", line 497, in impo
rt_user_module
    import_tasks(tasks_path, f"{module_name}.tasks")
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/fairseq/tasks/__init__.py", line 117
, in import_tasks
    importlib.import_module(namespace + "." + task_name)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/linjiayi/Graphormer/graphormer/tasks/graph_prediction.py", line 25, in <module>
    from ..data.dataset import (
  File "/home/linjiayi/Graphormer/graphormer/data/dataset.py", line 9, in <module>
    from .wrapper import MyPygGraphPropPredDataset
  File "/home/linjiayi/Graphormer/graphormer/data/wrapper.py", line 6, in <module>
    from ogb.graphproppred import PygGraphPropPredDataset
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/ogb/graphproppred/__init__.py", line
 5, in <module>
    from .dataset_pyg import PygGraphPropPredDataset
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/ogb/graphproppred/dataset_pyg.py", l
ine 1, in <module>
    from torch_geometric.data import InMemoryDataset
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_geometric/__init__.py", line 5
, in <module>
    import torch_geometric.data
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_geometric/data/__init__.py", l
ine 1, in <module>
    from .data import Data
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_geometric/data/data.py", line
8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_sparse/__init__.py", line 41,
in <module>
    from .tensor import SparseTensor  # noqa
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_sparse/tensor.py", line 13, in
 <module>
    class SparseTensor(object):
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/jit/_script.py", line 1128, in
 script
    _compile_and_register_class(obj, _rcb, qualified_name)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch/jit/_script.py", line 138, in
_compile_and_register_class
    script_class = torch._C._jit_script_class_compile(qualified_name, ast, defaults, rcb)
RuntimeError:
object has no attribute sparse_csr_tensor:
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.9/site-packages/torch_sparse/tensor.py", line 511
            value = torch.ones(self.nnz(), dtype=dtype, device=self.device())

        return torch.sparse_csr_tensor(rowptr, col, value, self.sizes())
               ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

I try to train a Graphormer-base on PCQM4M dataset on multiple GPU cards using bash pcqv1.sh, no response either. Is there a problem with the data set download? How to solve the problem?

skye95git commented 2 years ago

When I evaluate a pre-trained model using the script graphormer/evaluate/evaluate.py:

python evaluate.py \
    --user-dir ../../graphormer \
    --num-workers 16 \
    --ddp-backend=legacy_ddp \
    --dataset-name pcqm4m \
    --dataset-source ogb \
    --task graph_prediction \
    --criterion l1_loss \
    --arch graphormer_base \
    --num-classes 1 \
    --batch-size 64 \
    --pretrained-model-name pcqm4mv1_graphormer_base \
    --load-pretrained-model-output-layer \
    --split valid \
    --seed 1

There is no response. Is it the pre-training model download problem? Why is there no error if there is a problem with the model download?

skye95git commented 2 years ago

The torch.sparse_csr_tensor seems to be in pytorch1.11.0. But install. sh has Pytorch1.9.1 installed. Should I update PyTorch to 1.11.0?

zhengsx commented 2 years ago

Hi @skye95git , thanks very much for using Graphormer. We have already noticed your issue, and will have a plan to look into it. But it takes time to reproduce every corner cases reported in all issues, as well we're not full-time maintainer, so please kindly stay tuned if you could not solve this problem by yourself, and kindly stop rasing your question in other issue threads. Thanks very much for your understanding.

skye95git commented 2 years ago

Hi @skye95git , thanks very much for using Graphormer. We have already noticed your issue, and will have a plan to look into it. But it takes time to reproduce every corner cases reported in all issues, as well we're not full-time maintainer, so please kindly stay tuned if you could not solve this problem by yourself, and kindly stop rasing your question in other issue threads. Thanks very much for your understanding.

Sorry, I just want to ask someone with operational experience for a solution to this problem. Because I tried a lot of things and it didn't work out.
Today I tried to run v1.0 branch examples/benchmarking-gnns:

[ -z "${exp_name}" ] && exp_name="zinc"
[ -z "${seed}" ] && seed="1"
[ -z "${arch}" ] && arch="--ffn_dim 80 --hidden_dim 80 --num_heads 8 --dropout_rate 0.1 --n_layers 12 --peak_lr 2e-4 --edge_type multi_hop --multi_hop_max_dist 20"
[ -z "${warmup_updates}" ] && warmup_updates="40000"
[ -z "${tot_updates}" ] && tot_updates="400000"

echo -e "\n\n"
echo "=====================================ARGS======================================"
echo "arg0: $0"
echo "arch: ${arch}"
echo "seed: ${seed}"
echo "exp_name: ${exp_name}"
echo "warmup_updates: ${warmup_updates}"
echo "tot_updates: ${tot_updates}"
echo "==============================================================================="

save_path="../../exps/zinc/$exp_name-$warmup_updates-$tot_updates/$seed"
mkdir -p $save_path

CUDA_VISIBLE_DEVICES=0 \
      python ../../graphormer/entry.py --num_workers 8 --seed $seed --batch_size 256 \
      --dataset_name ZINC \
      --gpus 1 --accelerator ddp --precision 16 \
      $arch \
      --check_val_every_n_epoch 10 --warmup_updates $warmup_updates --tot_updates $tot_updates \
      --default_root_dir $save_path

There is an error:

Downloading https://www.dropbox.com/s/feo9qle74kg48gy/molecules.zip?dl=1
Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/urllib/request.py", line 1350, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/http/client.py", line 1443, in connect
    super().connect()
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/http/client.py", line 948, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/home/linjiayi/anaconda3/envs/graphormer_v1/lib/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

I want use this virtual environment(python 3.7, pytorch 1.7.0, cudatoolkit 10.2, torch-geometric 2.0.4) to run examples/property_prediction/zinc.sh in the main branch. I run:

cd fairseq
pip install . --use-feature=in-tree-build
python setup.py build_ext --inplace

Then I run:

bash zinc.sh

There is an error:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
zinc.sh: line 27: 3619596 Aborted                 (core dumped) CUDA_VISIBLE_DEVICES=1 fairseq-train --user-dir ../../graphormer --num-workers 16 --ddp-backend=legacy_ddp --dataset-name zinc --dataset-source pyg --task graph_prediction --criterion l1_loss --arch graphormer_slim --num-classes 1 --attention-dropout 0.1 --act-dropout 0.1 --dropout 0.0 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-8 --clip-norm 5.0 --weight-decay 0.01 --lr-scheduler polynomial_decay --power 1 --warmup-updates 60000 --total-num-update 400000 --lr 2e-4 --end-learning-rate 1e-9 --batch-size 64 --fp16 --data-buffer-size 20 --encoder-layers 12 --encoder-embed-dim 80 --encoder-ffn-embed-dim 80 --encoder-attention-heads 8 --max-epoch 10000 --save-dir ./ckpts

It seems that pip install . --use-feature=in-tree-build python setup.py build_ext --inplace cause the error Segmentation fault (core dumped).

skye95git commented 2 years ago

Hi @skye95git , thanks very much for using Graphormer. We have already noticed your issue, and will have a plan to look into it. But it takes time to reproduce every corner cases reported in all issues, as well we're not full-time maintainer, so please kindly stay tuned if you could not solve this problem by yourself, and kindly stop rasing your question in other issue threads. Thanks very much for your understanding.

Sorry. I just want to provide you one more piece of information. Every time I run examples/property_prediction/zinc.sh there is no response. Then when I interrupt the task, the following message is displayed:

Traceback (most recent call last):
  File "../../graphormer/entry.py", line 4, in <module>
    from model import Graphormer
  File "/home/linjiayi/Graphormer/graphormer/model.py", line 4, in <module>
    from data import get_dataset
  File "/home/linjiayi/Graphormer/graphormer/data.py", line 5, in <module>
    from wrapper import MyGraphPropPredDataset, MyPygPCQM4MDataset, MyZINCDataset
  File "/home/linjiayi/Graphormer/graphormer/wrapper.py", line 7, in <module>
    from ogb.graphproppred import PygGraphPropPredDataset
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/ogb/graphproppred/__init__.py", line 1, in <module>
    from .evaluate import Evaluator
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/ogb/graphproppred/evaluate.py", line 1, in <module>
    from sklearn.metrics import roc_auc_score, average_precision_score
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/sklearn/__init__.py", line 82, in <module>
    from .base import clone
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/sklearn/base.py", line 17, in <module>
    from .utils import _IS_32BIT
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/sklearn/utils/__init__.py", line 25, in <module>
    from . import _joblib
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/sklearn/utils/_joblib.py", line 7, in <module>
    import joblib
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/joblib/__init__.py", line 113, in <module>
    from .memory import Memory, MemorizedResult, register_store_backend
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/joblib/memory.py", line 32, in <module>
    from ._store_backends import StoreBackendBase, FileSystemStoreBackend
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/joblib/_store_backends.py", line 15, in <module>
    from .backports import concurrency_safe_rename
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/joblib/backports.py", line 7, in <module>
    from distutils.version import LooseVersion
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 963, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 906, in _find_spec
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/_distutils_hack/__init__.py", line 90, in find_spec
    return method()
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/_distutils_hack/__init__.py", line 101, in spec_for_distutils
    mod = importlib.import_module('setuptools._distutils')
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/setuptools/__init__.py", line 16, in <module>
    import setuptools.version
  File "/home/linjiayi/anaconda3/envs/graphormer/lib/python3.7/site-packages/setuptools/version.py", line 1, in <module>
    import pkg_resources
  File "<frozen importlib._bootstrap>", line 202, in _lock_unlock_module
  File "<frozen importlib._bootstrap>", line 98, in acquire
KeyboardInterrupt
skye95git commented 2 years ago

I have solved it. Thanks. pip uninstall setuptools

ayushnoori commented 2 years ago

Hi @skye95git, I have encountered the same behavior; i.e., there is no response at the console after running bash zinc.sh and interrupting the script produces a similar File "<frozen importlib._bootstrap>", line 107, in acquire error message. However, running pip uninstall setuptools did not fix the issue; rather, running bash zinc.sh now throws terminate called after throwing an instance of 'std::bad_alloc' (which is fixed by reinstalling setuptools). May you please advise - how did you successfully run the ZINC training example? Thanks.

ayushnoori commented 2 years ago

In case others have this issue, cc: @mswzeus from https://github.com/microsoft/Graphormer/issues/99. @zhengsx, I'd be happy to open a new issue or use Discussions if you'd prefer.

zhengsx commented 2 years ago

In case others have this issue, cc: @mswzeus from #99. @zhengsx, I'd be happy to open a new issue or use Discussions if you'd prefer.

We notice that similar problems occur recently, and a guess is that some recent upgrades of library lead to this problem. We're working on this problem, and maybe you can try to modify the version of cuda/torch, cu110 is recommeded.

ayushnoori commented 2 years ago

Thanks @zhengsx, will try to downgrade PyTorch and will report back if it solves the problem. Please don't hesitate to let me know if there's anything else I can do to help find a fix!

ayushnoori commented 2 years ago

FYI, tried downgrading to torch.version.cuda==10.2, but this did not fix the issue (i.e., bash zinc.sh still hangs). Will try cu110.

skye95git commented 2 years ago

Hi @skye95git, I have encountered the same behavior; i.e., there is no response at the console after running bash zinc.sh and interrupting the script produces a similar File "<frozen importlib._bootstrap>", line 107, in acquire error message. However, running pip uninstall setuptools did not fix the issue; rather, running bash zinc.sh now throws terminate called after throwing an instance of 'std::bad_alloc' (which is fixed by reinstalling setuptools). May you please advise - how did you successfully run the ZINC training example? Thanks.

Hi, this problem std::bad_alloc occurred when I was running V1.0. If you want to run bash zinc.sh in master, first you have to make sure your environment(Python, CUDA, PyTorch, and PyTorle-Geometric) is configured correctly:

python
>>> import torch
>>> torch.__version__
'1.11.0'
>>> torch.version.cuda
'11.3'
>>> torch.cuda.is_available()
True
python
from torch_geometric.data import DataLoader

If the above steps are ok, uninstall setuptools pip uninstall setuptools before running bash zinc.sh.

This works for me. Hope it works for you.

zhengsx commented 2 years ago

The bug seems to be caused by ogb version upgrade. Now it should be fixed by #114 . Please have a try. @skye95git @ayushnoori

doloresgarcia commented 1 year ago

I am having the same issue, ' Segmentation fault' when running bash zinc.sh, pip uninstall setuptools did not fix the issue, any ideas?

nslai00 commented 1 year ago

Hi @doloresgarcia , i have encountered the same problem. Have you figured out how to solve this problem?