Open VeritasYin opened 2 years ago
Interesting observations.
1.) I am surprised to hear the GPU memory was exceeded for this dataset, it should easily fit inside GPU memory given the dataset only has a 500,000 or so nodes. I've run datasets that are an order of magnitude larger on a single GPU. This may indicate a memory leak somewhere. At what point in the training did the system fail and do you have a stack trace?
2.) The MRR for both configurations look quite weird, the CPU one being obviously low and the GPU one being quite high. One difference between the two configurations is that the CPU config uses async training and the GPU configuration uses sync training. So my guess is that the async training is preventing model convergence for the CPU case. You can turn on sync training with training.synchronous=true
3.) The GPU MRR is suspiciously high, this may be due the evaluation configuration, which only samples 1000 nodes (500 uniformly and 500 by degree). You can try running filtered mrr (which will use all nodes to produce negatives) by changing the evaluation settings to:
[evaluation]
batch_size=1000
number_of_chunks=1
negative_sampling_access=All
evaluation_method=LinkPrediction
filtered_evaluation=true
If the hits@100 is inconsistent with leaderboard results for this dataset then that would indicate a bug somewhere and I can investigate further.
4.) The configuration for this dataset is not optimized to the specific dataset. These hyperparameters were chosen based on what worked well for the datasets in our paper (fb15k, livejournal, twitter and freebase86m). You will probably need to tune hyperparameters to get good model performance.
The OOM error is triggered during inference, I attached the trace log below
[info] [12/13/21 16:51:20.207] ################ Finished training epoch 1 ################
[info] [12/13/21 16:51:20.207] Epoch Runtime (Before shuffle/sync): 11955ms
[info] [12/13/21 16:51:20.207] Edges per Second (Before shuffle/sync): 1775987.6
[info] [12/13/21 16:51:20.209] Edges Shuffled
[info] [12/13/21 16:51:20.209] Epoch Runtime (Including shuffle/sync): 11956ms
[info] [12/13/21 16:51:20.209] Edges per Second (Including shuffle/sync): 1775839
Traceback (most recent call last): File "~/scratch/software/anaconda3/bin/marius_train", line 8, in <module> sys.exit(main()) File "~/scratch/software/anaconda3/lib/python3.7/site-packages/marius/console_scripts/marius_train.py", line 8, in main m.marius_train(len(sys.argv), sys.argv)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.65 GiB total capacity; 22.35 GiB already allocated; 17.44 MiB free; 22.75 GiB reserved in total by PyTorch)
Exception raised from malloc at /opt/conda/conda-bld/pytorch_1607370141920/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f5080f6d8b2 in ~/scratch/software/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1:
Ah this is with the filtered evaluation settings I sent above? I was hoping it wouldn't OOM.
That evaluation scenario is pretty memory intensive since it uses all 500,000 nodes as negatives to compute the MRR. You can try decreasing the evaluation batch size, but that will make the evaluation process quite slow.
If you want to compare to the OGB leaderboards I think what might be best is to export the trained embeddings from Marius and evaluate them using OGB evaluators.
May I ask what your intent is with training on this dataset? I can provide better recommendations and system configuration if I know what your end goal is.
Describe the bug I got really wired results regarding the evaluation on the dataset ogbl-ppa with CPU and with GPU, respectively. I have to change the memory to HostDevice for GPU version due to its overwhelming GRAM consumption (I thought the code could be running with 16G but it eventually exceeded 24GB).
To Reproduce Steps to reproduce the behavior: Run the marius script with config ogbl_ppa_cpu.ini and ogbl_ppa_gpu.ini, and then we have the following results
[2021-12-12 02:47:01.554] [info] [trainer.cpp:68] ################ Starting training epoch 3 ################ [2021-12-12 02:49:36.904] [info] [trainer.cpp:94] Total Edges Processed: 44586862, Percent Complete: 0.100 [2021-12-12 02:52:19.113] [info] [trainer.cpp:94] Total Edges Processed: 46709862, Percent Complete: 0.200 [2021-12-12 02:55:00.754] [info] [trainer.cpp:94] Total Edges Processed: 48832862, Percent Complete: 0.300 [2021-12-12 02:57:44.074] [info] [trainer.cpp:94] Total Edges Processed: 50955862, Percent Complete: 0.400 [2021-12-12 03:00:25.467] [info] [trainer.cpp:94] Total Edges Processed: 53078862, Percent Complete: 0.500 [2021-12-12 03:03:09.531] [info] [trainer.cpp:94] Total Edges Processed: 55201862, Percent Complete: 0.600 [2021-12-12 03:06:03.269] [info] [trainer.cpp:94] Total Edges Processed: 57324862, Percent Complete: 0.700 [2021-12-12 03:08:51.169] [info] [trainer.cpp:94] Total Edges Processed: 59447862, Percent Complete: 0.800 [2021-12-12 03:11:32.560] [info] [trainer.cpp:94] Total Edges Processed: 61570862, Percent Complete: 0.900 [2021-12-12 03:14:13.438] [info] [trainer.cpp:94] Total Edges Processed: 63693862, Percent Complete: 1.000 [2021-12-12 03:14:13.558] [info] [trainer.cpp:99] ################ Finished training epoch 3 ################ [2021-12-12 03:14:13.558] [info] [trainer.cpp:104] Epoch Runtime (Before shuffle/sync): 1632004ms [2021-12-12 03:14:13.558] [info] [trainer.cpp:105] Edges per Second (Before shuffle/sync): 13009.73 [2021-12-12 03:14:14.870] [info] [dataset.cpp:761] Edges Shuffled [2021-12-12 03:14:14.870] [info] [trainer.cpp:113] Epoch Runtime (Including shuffle/sync): 1633315ms [2021-12-12 03:14:14.870] [info] [trainer.cpp:114] Edges per Second (Including shuffle/sync): 12999.288 [2021-12-12 03:14:37.284] [info] [evaluator.cpp:95] Num Eval Edges: 6062562 [2021-12-12 03:14:37.284] [info] [evaluator.cpp:96] Num Eval Batches: 0 [2021-12-12 03:14:37.284] [info] [evaluator.cpp:97] Auc: 0.508, Avg Ranks: 490.966, MRR: 0.008, Hits@1: 0.006, Hits@5: 0.007, Hits@10: 0.007, Hits@20: 0.008, Hits@50: 0.008, Hits@100: 0.009
[2021-12-13 01:53:58.848] [info] [trainer.cpp:68] ################ Starting training epoch 3 ################ [2021-12-13 01:54:03.413] [info] [trainer.cpp:94] Total Edges Processed: 44583862, Percent Complete: 0.100 [2021-12-13 01:54:07.270] [info] [trainer.cpp:94] Total Edges Processed: 46703862, Percent Complete: 0.200 [2021-12-13 01:54:11.005] [info] [trainer.cpp:94] Total Edges Processed: 48823862, Percent Complete: 0.299 [2021-12-13 01:54:15.259] [info] [trainer.cpp:94] Total Edges Processed: 50943862, Percent Complete: 0.399 [2021-12-13 01:54:19.315] [info] [trainer.cpp:94] Total Edges Processed: 53063862, Percent Complete: 0.499 [2021-12-13 01:54:23.355] [info] [trainer.cpp:94] Total Edges Processed: 55183862, Percent Complete: 0.599 [2021-12-13 01:54:27.633] [info] [trainer.cpp:94] Total Edges Processed: 57303862, Percent Complete: 0.699 [2021-12-13 01:54:31.465] [info] [trainer.cpp:94] Total Edges Processed: 59423862, Percent Complete: 0.798 [2021-12-13 01:54:35.505] [info] [trainer.cpp:94] Total Edges Processed: 61543862, Percent Complete: 0.898 [2021-12-13 01:54:39.482] [info] [trainer.cpp:94] Total Edges Processed: 63663862, Percent Complete: 0.998 [2021-12-13 01:54:39.547] [info] [trainer.cpp:99] ################ Finished training epoch 3 ################ [2021-12-13 01:54:39.547] [info] [trainer.cpp:104] Epoch Runtime (Before shuffle/sync): 40698ms [2021-12-13 01:54:39.547] [info] [trainer.cpp:105] Edges per Second (Before shuffle/sync): 521694.72 [2021-12-13 01:54:40.847] [info] [dataset.cpp:761] Edges Shuffled [2021-12-13 01:54:40.847] [info] [trainer.cpp:113] Epoch Runtime (Including shuffle/sync): 41998ms [2021-12-13 01:54:40.847] [info] [trainer.cpp:114] Edges per Second (Including shuffle/sync): 505546.25 [2021-12-13 01:54:58.952] [info] [evaluator.cpp:95] Num Eval Edges: 6062562 [2021-12-13 01:54:58.952] [info] [evaluator.cpp:96] Num Eval Batches: 0 [2021-12-13 01:54:58.952] [info] [evaluator.cpp:97] Auc: 0.992, Avg Ranks: 2.925, MRR: 0.991, Hits@1: 0.990, Hits@5: 0.991, Hits@10: 0.991, Hits@20: 0.992, Hits@50: 0.993, Hits@100: 0.995
Environment List your operating system, and dependency versions Python 3.7.10 pytorch 1.7.1 (py3.7_cuda10.1.243_cudnn7.6.3_0) gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) cmake version 3.16.3 GNU Make 4.2.1