marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Ondisk training: RuntimeError: start(0) + length (-512) exceeds dimension size (0) #135

Closed lwwlwwl closed 1 year ago

lwwlwwl commented 1 year ago

Hi, I am trying to reproduce on_disk training results but facing RuntimeError: start(0) + length (-512) exceeds dimension size (0) for all datasets I tried (products, papers, mag240) using main branch (latest version and the version before that), main_improvement branch and eurosys_2023_artifact branch.

The machine has: Nvidia V100 GPU, 126GB ram, 80vcpu, ubuntu20.04

Steps to reproduce:

  1. build docker with Dockerfile
  2. git clone https://github.com/marius-team/marius.git
  3. cd marius
  4. pip3 install . (if using a different branch, do git checkout before pip install)
  5. preprocess ogbn_products by specifying num_partitions using marius_preprocess
  6. define config.yaml (for main / eurosys_2023_artifact adapted from ogbn_arxiv/marius_gs_disk_gpu.yaml)
  7. marius_train config.yaml

Error messages for main:

root@023617de2605:~# marius_train products.yaml 
[2023-03-02 07:06:36.990] [info] [marius.cpp:41] Start initialization
[03/02/23 07:06:42.450] Initialization Complete: 5.46s
[03/02/23 07:06:42.452] Generating Sequential Ordering
[03/02/23 07:06:42.452] Num Train Partitions: 32
Traceback (most recent call last):
  File "/usr/local/bin/marius_train", line 11, in <module>
    load_entry_point('marius==0.0.2', 'console_scripts', 'marius_train')()
  File "/usr/local/lib/python3.6/dist-packages/marius/console_scripts/marius_train.py", line 22, in main
    m.manager.marius_train(config)
RuntimeError: start (0) + length (-12) exceeds dimension size (0).

for eurosys:

image

The value for 'Num Train Partitions' seems to be the num_partitions that is passed to the configuration file while by design, I believe it should be much smaller.

Different combinations of num_partitions and buffer_capacity were tested and it was found that num_partitions-buffer_capacity always equals to the number shown in the parenthesis after length in the error message.

rogerwaleffe commented 1 year ago

Thanks for the issue and the detailed post!

Generally, "start + length" errors mean that the expected shape of storage objects from the config file does not actually match the shape of the files on disk. This is usually indicative of a preprocessing issue. In this case, I'm guessing the issue is that you are trying to use the sequential node partition ordering but you have not passed --sequential_train_nodes to the preprocessing function.

I tested for this hypothesis using your product_eurosys.yaml file and found that if I run marius_preprocess --dataset ogbn_products --num_partitions 32 <dir> then I get the same issue as above. If I run marius_preprocess --dataset ogbn_products --num_partitions 32 --sequential_train_nodes <dir> then things work as expected. The Num Train Partitions for products with 32 total partitions should be 3.

lwwlwwl commented 1 year ago

Thanks! Adding the additional flag solves the issue. What does this flag mean exactly? Is this indicating that the input nodes are sequentially from 0 to num_nodes -1?

rogerwaleffe commented 1 year ago

So the way the disk-based training works is that we split the nodes of the graph into partitions, and then we load subsets of these partitions into CPU memory for training. By default, when doing random partitioning the labeled training nodes in the graph will be "dispersed" throughout the partitions. In this case, the dispersed disk-based training will ensure that all partitions are brought into memory at least once each epoch so that all training nodes can be used for learning. When the number of training nodes is small however, it improves performance to simply cache the training nodes in CPU memory. To do this using the same partition abstraction as described above, what we do is partition the graph such that the train nodes are "sequential" in the first X partitions. I.e., we relabel the nodes such that the training nodes are labeled with IDs from 0 to num_train_nodes-1, and then all other nodes have IDs greater than or equal to num_train_nodes. This allows us to know where the training nodes are in the partitions, and then we can simply cache the first X partitions in memory during training.

So to answer your question, that additional flag does the extra relabeling/shuffling of the graph such that the training nodes are labeled from 0 to num_train_nodes-1 and such that they are then contiguous in the first X partitions, where X = ceil(num_train_nodes/num_nodes_per_partition).

I'll admit the "sequential" and "dispersed" language is a bit confusing. It's leftover from the development of MariusGNN, but could benefit from more descriptive names for the open source release.

Closing this issue with this comment.