marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Low evaluation accuracy on the ogbn-arxiv example in the doc #153

Open shengzeang opened 8 months ago

shengzeang commented 8 months ago

Describe the bug

After successfully building marius, I try to reproduce the [ogbn-arxiv example](https://marius-project.org/marius/examples/config/nc_ogbn_arxiv.html) in the doc following the instructions. There shows no error during marius_preprocess and marius_train. However, the evaluation accuracy is low compared to the results in the doc. Further, the accuracy does not quite improve during the training process. The validation accuracy fluctuates around 53%, and test accuracy around 50%. The training configuration file is the same as the one given in the example of the doc.

My training outputs at epoch 1, 5 and 10:

[01/03/24 07:10:34.239] ################ Starting training epoch 1 ################
[01/03/24 07:10:36.136] Nodes processed: [10000/90941], 11.00%
[01/03/24 07:10:36.683] Nodes processed: [20000/90941], 21.99%
[01/03/24 07:10:38.389] Nodes processed: [30000/90941], 32.99%
[01/03/24 07:10:39.993] Nodes processed: [40000/90941], 43.98%
[01/03/24 07:10:40.599] Nodes processed: [50000/90941], 54.98%
[01/03/24 07:10:41.634] Nodes processed: [60000/90941], 65.98%
[01/03/24 07:10:42.896] Nodes processed: [70000/90941], 76.97%
[01/03/24 07:10:43.368] Nodes processed: [80000/90941], 87.97%
[01/03/24 07:10:44.518] Nodes processed: [90000/90941], 98.97%
[01/03/24 07:10:44.680] Nodes processed: [90941/90941], 100.00%
[01/03/24 07:10:44.680] ################ Finished training epoch 1 ################
[01/03/24 07:10:44.680] Epoch Runtime: 10441ms
[01/03/24 07:10:44.680] Nodes per Second: 8709.989
[01/03/24 07:10:44.680] Evaluating validation set
[01/03/24 07:10:46.114] 
=================================
Node Classification: 29799 nodes evaluated
Accuracy: 49.934562%
=================================
[01/03/24 07:10:46.114] Evaluating test set
[01/03/24 07:10:50.094] 
=================================
Node Classification: 48603 nodes evaluated
Accuracy: 47.850956%
[01/03/24 07:11:33.628] ################ Starting training epoch 5 ################
[01/03/24 07:11:34.587] Nodes processed: [10000/90941], 11.00%
[01/03/24 07:11:36.210] Nodes processed: [20000/90941], 21.99%
[01/03/24 07:11:37.551] Nodes processed: [30000/90941], 32.99%
[01/03/24 07:11:38.041] Nodes processed: [40000/90941], 43.98%
[01/03/24 07:11:38.623] Nodes processed: [50000/90941], 54.98%
[01/03/24 07:11:39.214] Nodes processed: [60000/90941], 65.98%
[01/03/24 07:11:39.721] Nodes processed: [70000/90941], 76.97%
[01/03/24 07:11:40.329] Nodes processed: [80000/90941], 87.97%
[01/03/24 07:11:40.892] Nodes processed: [90000/90941], 98.97%
[01/03/24 07:11:40.986] Nodes processed: [90941/90941], 100.00%
[01/03/24 07:11:40.986] ################ Finished training epoch 5 ################
[01/03/24 07:11:40.986] Epoch Runtime: 7357ms
[01/03/24 07:11:40.986] Nodes per Second: 12361.153
[01/03/24 07:11:40.986] Evaluating validation set
[01/03/24 07:11:42.016] 
=================================
Node Classification: 29799 nodes evaluated
Accuracy: 53.384342%
=================================
[01/03/24 07:11:42.016] Evaluating test set
[01/03/24 07:11:43.721] 
=================================
Node Classification: 48603 nodes evaluated
Accuracy: 50.550378%
[01/03/24 07:12:15.394] ################ Starting training epoch 10 ################
[01/03/24 07:12:15.930] Nodes processed: [10000/90941], 11.00%
[01/03/24 07:12:16.525] Nodes processed: [20000/90941], 21.99%
[01/03/24 07:12:17.155] Nodes processed: [30000/90941], 32.99%
[01/03/24 07:12:17.642] Nodes processed: [40000/90941], 43.98%
[01/03/24 07:12:18.262] Nodes processed: [50000/90941], 54.98%
[01/03/24 07:12:18.858] Nodes processed: [60000/90941], 65.98%
[01/03/24 07:12:19.383] Nodes processed: [70000/90941], 76.97%
[01/03/24 07:12:19.978] Nodes processed: [80000/90941], 87.97%
[01/03/24 07:12:20.599] Nodes processed: [90000/90941], 98.97%
[01/03/24 07:12:20.644] Nodes processed: [90941/90941], 100.00%
[01/03/24 07:12:20.644] ################ Finished training epoch 10 ################
[01/03/24 07:12:20.644] Epoch Runtime: 5249ms
[01/03/24 07:12:20.644] Nodes per Second: 17325.395
[01/03/24 07:12:20.644] Evaluating validation set
[01/03/24 07:12:21.676] 
=================================
Node Classification: 29799 nodes evaluated
Accuracy: 52.917883%
=================================
[01/03/24 07:12:21.676] Evaluating test set
[01/03/24 07:12:23.386] 
=================================
Node Classification: 48603 nodes evaluated
Accuracy: 50.303479%

I wonder what could go wrong during the whole process. The build is successful, and there shows no error during preprocessing and training. My environment is listed below.

I'd be glad for any help! Thank you!

Environment Results of running marius_env_info:

cmake:
  version: 3.28.1
cpu_info:
  num_cpus: 96
  total_memory: 375GB
cuda:
  version: '11.7'
gpu_info:
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
marius:
  bindings_installed: true
  install_path: /usr/local/lib/python3.8/dist-packages/marius
  version: 0.0.2
openmp:
  version: '201511'
operating_system:
  platform: Linux-3.10.107-1-tlinux2-0054-x86_64-with-glibc2.29
pybind:
  PYBIND11_BUILD_ABI: _cxxabi1011
  PYBIND11_COMPILER_TYPE: _gcc
  PYBIND11_STDLIB: _libstdcpp
python:
  deps:
    numpy_version: 1.24.4
    omegaconf_version: 2.3.0
    pandas_version: 2.0.3
    pip_version: 20.0.2
    pyspark_version: N/A
    pytest_version: N/A
    torch_version: !!python/object/new:torch.torch_version.TorchVersion
      - 2.0.1+cu117
    tox_version: N/A
  version: "3.8.10 (default, Nov 22 2023, 10:22:35) \n[GCC 9.4.0]"
pytorch:
  install_path: /usr/local/lib/python3.8/dist-packages/torch
  version: !!python/object/new:torch.torch_version.TorchVersion
    - 2.0.1+cu117