marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Inconsistent results with CPU and GPU configs on the dataset ogbl-ppa #82

Open VeritasYin opened 2 years ago

VeritasYin commented 2 years ago

Describe the bug I got really wired results regarding the evaluation on the dataset ogbl-ppa with CPU and with GPU, respectively. I have to change the memory to HostDevice for GPU version due to its overwhelming GRAM consumption (I thought the code could be running with 16G but it eventually exceeded 24GB).

To Reproduce Steps to reproduce the behavior: Run the marius script with config ogbl_ppa_cpu.ini and ogbl_ppa_gpu.ini, and then we have the following results

[2021-12-12 02:47:01.554] [info] [trainer.cpp:68] ################ Starting training epoch 3 ################ [2021-12-12 02:49:36.904] [info] [trainer.cpp:94] Total Edges Processed: 44586862, Percent Complete: 0.100 [2021-12-12 02:52:19.113] [info] [trainer.cpp:94] Total Edges Processed: 46709862, Percent Complete: 0.200 [2021-12-12 02:55:00.754] [info] [trainer.cpp:94] Total Edges Processed: 48832862, Percent Complete: 0.300 [2021-12-12 02:57:44.074] [info] [trainer.cpp:94] Total Edges Processed: 50955862, Percent Complete: 0.400 [2021-12-12 03:00:25.467] [info] [trainer.cpp:94] Total Edges Processed: 53078862, Percent Complete: 0.500 [2021-12-12 03:03:09.531] [info] [trainer.cpp:94] Total Edges Processed: 55201862, Percent Complete: 0.600 [2021-12-12 03:06:03.269] [info] [trainer.cpp:94] Total Edges Processed: 57324862, Percent Complete: 0.700 [2021-12-12 03:08:51.169] [info] [trainer.cpp:94] Total Edges Processed: 59447862, Percent Complete: 0.800 [2021-12-12 03:11:32.560] [info] [trainer.cpp:94] Total Edges Processed: 61570862, Percent Complete: 0.900 [2021-12-12 03:14:13.438] [info] [trainer.cpp:94] Total Edges Processed: 63693862, Percent Complete: 1.000 [2021-12-12 03:14:13.558] [info] [trainer.cpp:99] ################ Finished training epoch 3 ################ [2021-12-12 03:14:13.558] [info] [trainer.cpp:104] Epoch Runtime (Before shuffle/sync): 1632004ms [2021-12-12 03:14:13.558] [info] [trainer.cpp:105] Edges per Second (Before shuffle/sync): 13009.73 [2021-12-12 03:14:14.870] [info] [dataset.cpp:761] Edges Shuffled [2021-12-12 03:14:14.870] [info] [trainer.cpp:113] Epoch Runtime (Including shuffle/sync): 1633315ms [2021-12-12 03:14:14.870] [info] [trainer.cpp:114] Edges per Second (Including shuffle/sync): 12999.288 [2021-12-12 03:14:37.284] [info] [evaluator.cpp:95] Num Eval Edges: 6062562 [2021-12-12 03:14:37.284] [info] [evaluator.cpp:96] Num Eval Batches: 0 [2021-12-12 03:14:37.284] [info] [evaluator.cpp:97] Auc: 0.508, Avg Ranks: 490.966, MRR: 0.008, Hits@1: 0.006, Hits@5: 0.007, Hits@10: 0.007, Hits@20: 0.008, Hits@50: 0.008, Hits@100: 0.009

[2021-12-13 01:53:58.848] [info] [trainer.cpp:68] ################ Starting training epoch 3 ################ [2021-12-13 01:54:03.413] [info] [trainer.cpp:94] Total Edges Processed: 44583862, Percent Complete: 0.100 [2021-12-13 01:54:07.270] [info] [trainer.cpp:94] Total Edges Processed: 46703862, Percent Complete: 0.200 [2021-12-13 01:54:11.005] [info] [trainer.cpp:94] Total Edges Processed: 48823862, Percent Complete: 0.299 [2021-12-13 01:54:15.259] [info] [trainer.cpp:94] Total Edges Processed: 50943862, Percent Complete: 0.399 [2021-12-13 01:54:19.315] [info] [trainer.cpp:94] Total Edges Processed: 53063862, Percent Complete: 0.499 [2021-12-13 01:54:23.355] [info] [trainer.cpp:94] Total Edges Processed: 55183862, Percent Complete: 0.599 [2021-12-13 01:54:27.633] [info] [trainer.cpp:94] Total Edges Processed: 57303862, Percent Complete: 0.699 [2021-12-13 01:54:31.465] [info] [trainer.cpp:94] Total Edges Processed: 59423862, Percent Complete: 0.798 [2021-12-13 01:54:35.505] [info] [trainer.cpp:94] Total Edges Processed: 61543862, Percent Complete: 0.898 [2021-12-13 01:54:39.482] [info] [trainer.cpp:94] Total Edges Processed: 63663862, Percent Complete: 0.998 [2021-12-13 01:54:39.547] [info] [trainer.cpp:99] ################ Finished training epoch 3 ################ [2021-12-13 01:54:39.547] [info] [trainer.cpp:104] Epoch Runtime (Before shuffle/sync): 40698ms [2021-12-13 01:54:39.547] [info] [trainer.cpp:105] Edges per Second (Before shuffle/sync): 521694.72 [2021-12-13 01:54:40.847] [info] [dataset.cpp:761] Edges Shuffled [2021-12-13 01:54:40.847] [info] [trainer.cpp:113] Epoch Runtime (Including shuffle/sync): 41998ms [2021-12-13 01:54:40.847] [info] [trainer.cpp:114] Edges per Second (Including shuffle/sync): 505546.25 [2021-12-13 01:54:58.952] [info] [evaluator.cpp:95] Num Eval Edges: 6062562 [2021-12-13 01:54:58.952] [info] [evaluator.cpp:96] Num Eval Batches: 0 [2021-12-13 01:54:58.952] [info] [evaluator.cpp:97] Auc: 0.992, Avg Ranks: 2.925, MRR: 0.991, Hits@1: 0.990, Hits@5: 0.991, Hits@10: 0.991, Hits@20: 0.992, Hits@50: 0.993, Hits@100: 0.995

Environment List your operating system, and dependency versions Python 3.7.10 pytorch 1.7.1 (py3.7_cuda10.1.243_cudnn7.6.3_0) gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) cmake version 3.16.3 GNU Make 4.2.1

JasonMoho commented 2 years ago

Interesting observations.

1.) I am surprised to hear the GPU memory was exceeded for this dataset, it should easily fit inside GPU memory given the dataset only has a 500,000 or so nodes. I've run datasets that are an order of magnitude larger on a single GPU. This may indicate a memory leak somewhere. At what point in the training did the system fail and do you have a stack trace?

2.) The MRR for both configurations look quite weird, the CPU one being obviously low and the GPU one being quite high. One difference between the two configurations is that the CPU config uses async training and the GPU configuration uses sync training. So my guess is that the async training is preventing model convergence for the CPU case. You can turn on sync training with training.synchronous=true

3.) The GPU MRR is suspiciously high, this may be due the evaluation configuration, which only samples 1000 nodes (500 uniformly and 500 by degree). You can try running filtered mrr (which will use all nodes to produce negatives) by changing the evaluation settings to:

[evaluation]
batch_size=1000
number_of_chunks=1
negative_sampling_access=All
evaluation_method=LinkPrediction
filtered_evaluation=true

If the hits@100 is inconsistent with leaderboard results for this dataset then that would indicate a bug somewhere and I can investigate further.

4.) The configuration for this dataset is not optimized to the specific dataset. These hyperparameters were chosen based on what worked well for the datasets in our paper (fb15k, livejournal, twitter and freebase86m). You will probably need to tune hyperparameters to get good model performance.

VeritasYin commented 2 years ago

The OOM error is triggered during inference, I attached the trace log below

[info] [12/13/21 16:51:20.207] ################ Finished training epoch 1 ################ [info] [12/13/21 16:51:20.207] Epoch Runtime (Before shuffle/sync): 11955ms [info] [12/13/21 16:51:20.207] Edges per Second (Before shuffle/sync): 1775987.6 [info] [12/13/21 16:51:20.209] Edges Shuffled [info] [12/13/21 16:51:20.209] Epoch Runtime (Including shuffle/sync): 11956ms [info] [12/13/21 16:51:20.209] Edges per Second (Including shuffle/sync): 1775839 Traceback (most recent call last): File "~/scratch/software/anaconda3/bin/marius_train", line 8, in <module> sys.exit(main()) File "~/scratch/software/anaconda3/lib/python3.7/site-packages/marius/console_scripts/marius_train.py", line 8, in main m.marius_train(len(sys.argv), sys.argv) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.65 GiB total capacity; 22.35 GiB already allocated; 17.44 MiB free; 22.75 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1607370141920/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f5080f6d8b2 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0x2024b (0x7f50798ee24b in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x21064 (0x7f50798ef064 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x216ad (0x7f50798ef6ad in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #4: THCStorage_resizeBytes(THCState, c10::StorageImpl, long) + 0x84 (0x7f5082153524 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #5: at::native::empty_strided_cuda(c10::ArrayRef, c10::ArrayRef, c10::TensorOptions const&) + 0x7f3 (0x7f508400dbf3 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #6: + 0xae35a5 (0x7f50820a95a5 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xb11a3c (0x7f50820d7a3c in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #8: + 0xb0fb2b (0x7f50820d5b2b in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #9: + 0xf9b985 (0x7f50b7b9e985 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #10: + 0xf7739b (0x7f50b7b7a39b in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #11: + 0xf7cb6b (0x7f50b7b7fb6b in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #12: + 0xf9b985 (0x7f50b7b9e985 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #13: at::empty_strided(c10::ArrayRef, c10::ArrayRef, c10::TensorOptions const&) + 0x22e (0x7f50b7c68b0e in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #14: at::native::to(at::Tensor const&, c10::ScalarType, bool, bool, c10::optional) + 0x560 (0x7f50b78a8fb0 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #15: + 0x11d303a (0x7f50b7dd603a in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #16: + 0x824901 (0x7f50b7427901 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #17: + 0x12bf686 (0x7f50b7ec2686 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #18: at::Tensor::to(c10::ScalarType, bool, bool, c10::optional) const + 0x100 (0x7f50b7ea6220 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #19: + 0xc185fc (0x7f50b781b5fc in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #20: + 0xc18970 (0x7f50b781b970 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #21: at::native::sum_out(at::Tensor&, at::Tensor const&, c10::ArrayRef, bool, c10::optional) + 0x8f (0x7f50b781ba5f in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #22: at::native::sum(at::Tensor const&, c10::ArrayRef, bool, c10::optional) + 0x4b (0x7f50b781c0bb in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #23: + 0xacfd84 (0x7f5082095d84 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #24: + 0xb0a42e (0x7f50820d042e in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #25: + 0x1162729 (0x7f50b7d65729 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #26: at::sum(at::Tensor const&, c10::ArrayRef, bool, c10::optional) + 0xf2 (0x7f50b7c7f8d2 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #27: + 0x25eb24b (0x7f50b91ee24b in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #28: + 0x8246ce (0x7f50b74276ce in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #29: + 0x1162729 (0x7f50b7d65729 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #30: at::Tensor::sum(c10::ArrayRef, bool, c10::optional) const + 0xf2 (0x7f50b7e99df2 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #31: + 0xde4bf (0x7f50bdb5e4bf in ~/scratch/software/anaconda3/lib/python3.7/site-packages/marius/_pymarius.cpython-37m-x86_64-linux-gnu.so) frame #32: + 0xdb58d (0x7f50bdb5b58d in ~/scratch/software/anaconda3/lib/python3.7/site-packages/marius/_pymarius.cpython-37m-x86_64-linux-gnu.so) frame #33: + 0xd9037 (0x7f50bdb59037 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/marius/_pymarius.cpython-37m-x86_64-linux-gnu.so) frame #34: + 0x8d59d (0x7f50bdb0d59d in ~/scratch/software/anaconda3/lib/python3.7/site-packages/marius/_pymarius.cpython-37m-x86_64-linux-gnu.so) frame #35: + 0xb2673 (0x7f50bdb32673 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/marius/_pymarius.cpython-37m-x86_64-linux-gnu.so) frame #36: _PyMethodDef_RawFastCallKeywords + 0x264 (0x56024dee68b4 in ~/scratch/software/anaconda3/bin/python3) frame #37: _PyCFunction_FastCallKeywords + 0x21 (0x56024dee69d1 in ~/scratch/software/anaconda3/bin/python3) frame #38: _PyEval_EvalFrameDefault + 0x4e0a (0x56024df52e5a in ~/scratch/software/anaconda3/bin/python3) frame #39: _PyFunction_FastCallKeywords + 0xfb (0x56024dee5e2b in ~/scratch/software/anaconda3/bin/python3) frame #40: _PyEval_EvalFrameDefault + 0x416 (0x56024df4e466 in ~/scratch/software/anaconda3/bin/python3) frame #41: _PyEval_EvalCodeWithName + 0x2f9 (0x56024de95d09 in ~/scratch/software/anaconda3/bin/python3) frame #42: PyEval_EvalCodeEx + 0x44 (0x56024de96be4 in ~/scratch/software/anaconda3/bin/python3) frame #43: PyEval_EvalCode + 0x1c (0x56024de96c0c in ~/scratch/software/anaconda3/bin/python3) frame #44: + 0x22ca74 (0x56024dfada74 in ~/scratch/software/anaconda3/bin/python3) frame #45: PyRun_FileExFlags + 0xa1 (0x56024dfb7de1 in ~/scratch/software/anaconda3/bin/python3) frame #46: PyRun_SimpleFileExFlags + 0x1c3 (0x56024dfb7fd3 in ~/scratch/software/anaconda3/bin/python3) frame #47: + 0x238105 (0x56024dfb9105 in ~/scratch/software/anaconda3/bin/python3) frame #48: _Py_UnixMain + 0x3c (0x56024dfb922c in ~/scratch/software/anaconda3/bin/python3) frame #49: __libc_start_main + 0xf3 (0x7f50be5db0b3 in /lib/x86_64-linux-gnu/libc.so.6) frame #50: + 0x1dce90 (0x56024df5de90 in ~/scratch/software/anaconda3/bin/python3)

JasonMoho commented 2 years ago

Ah this is with the filtered evaluation settings I sent above? I was hoping it wouldn't OOM.

That evaluation scenario is pretty memory intensive since it uses all 500,000 nodes as negatives to compute the MRR. You can try decreasing the evaluation batch size, but that will make the evaluation process quite slow.

If you want to compare to the OGB leaderboards I think what might be best is to export the trained embeddings from Marius and evaluate them using OGB evaluators.

May I ask what your intent is with training on this dataset? I can provide better recommendations and system configuration if I know what your end goal is.