marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

[Question] Large GPU Memory Usage & Early Exit of MariusGNN-Eurosys23 #129

Closed CSLabor closed 1 year ago

CSLabor commented 1 year ago

Hi, thank you for this excellent work!

I am trying to reproduce some of the results with a 2080Ti (11GB) but seem to encounter the GPU memory usage problem. Specifically, when I ran python3 experiment_manager/run_experiment.py --experiment papers100m with the default config of papers100M, the training seems exit abnormally fast while with no error:

==== ogbn_papers100m already preprocessed =====
=========================================
Running: marius 
Configuration: experiment_manager/system_comparisons/configs/ogbn_papers100m/marius_gs.yaml
Saving results to: results/ogbn_papers100m/marius_gs
[2022-12-22 16:26:25.906] [info] [marius.cpp:29] Start initialization
[2022-12-22 16:31:01.955] [info] [marius.cpp:66] Initialization Complete: 276.048s
[2022-12-22 16:32:21.671] [info] [trainer.cpp:41] ################ Starting training epoch 1 ################
Complete. Total runtime: 366.0947s

But after I modified the config with small hidden dimension (16 instead of 256) and small train batchsize (600 instead of 1000), the system run normally:

==== ogbn_papers100m already preprocessed =====
=========================================
Running: marius 
Configuration: experiment_manager/system_comparisons/configs/ogbn_papers100m/marius_gs.yaml
Saving results to: results/ogbn_papers100m/marius_gs
Overwriting previous experiment.
[2022-12-22 16:22:29.642] [info] [marius.cpp:29] Start initialization
[2022-12-22 16:27:13.260] [info] [marius.cpp:66] Initialization Complete: 283.617s
[2022-12-22 16:28:12.311] [info] [trainer.cpp:41] ################ Starting training epoch 1 ################
[2022-12-22 16:28:23.558] [info] [reporting.cpp:167] Nodes processed: [121200/1207179], 10.039936%
[2022-12-22 16:28:34.565] [info] [reporting.cpp:167] Nodes processed: [242400/1207179], 20.079872%
[2022-12-22 16:28:43.379] [info] [reporting.cpp:167] Nodes processed: [363600/1207179], 30.119808%
[2022-12-22 16:28:51.657] [info] [reporting.cpp:167] Nodes processed: [484800/1207179], 40.159744%
[2022-12-22 16:28:58.793] [info] [reporting.cpp:167] Nodes processed: [606000/1207179], 50.199680%
....

So I suspect that the abnormal early Complete message actually implies GPU OOM here?

Then MariusGNN seems to use significantly larger GPU memory than DGL? Since I can easily scale batch size to over 8000 under the same fanouts & hidden & GPU. Does this observation comply with the MariusGNN's internal design?

I am very grateful if you could help to explain, thank you!

rogerwaleffe commented 1 year ago

Hi,

Thanks for posting this issue!

I agree with you that the abnormally fast exit is likely due to a GPU OOM issue. In the past I've seen error messages when this occurs, but given that you were able to get things to run by modifying a few config parameters, I think we can assume that an OOM was the cause.

A few things to note: First, we ran the default papers100M config on a V100 with 16GB of memory (and I think it was using about 15GB of the 16GB), so it is indeed possible that it would throw an OOM on a 11GB 2080Ti. However, I do not think it is necessary to reduce the hidden dimension or the batch size. If I recall, synchronous training of the default config (i.e., by setting sync: true) uses about 3-4GB of GPU memory. Thus, it should be possible to train the default config in your setup. I'm guessing the issue has to do with the async pipeline config. Specifically, batch_device_queue_size: 64 in the config means that up to 64 batches can be waiting for processing in the GPU memory at any given time. This may not have been an issue on the V100 because the compute can process batches nearly as quickly as they arrive (so the GPU queue never has more than a few waiting), but could be an issue on the 2080Ti. I would suggest starting with sync training and verifying that this works, then move to async training and start with queue sizes of 2 and then increase from there. You can set the staleness bound to be the sum of the two queue sizes.

Hopefully this removes the memory issues you are having. I would expect sync MariusGNN training to have similar GPU memory usage compared to DGL. Let me know what you find given the above.

CSLabor commented 1 year ago

Hi @rogerwaleffe, thanks for your reply!

The GPU memory issue indeed came from the async setting. After switching to sync training, the memory consumption of MariusGNN is similar to that of DGL, thank you for your help!

And I have a small followup questions: After the training of all epochs, I found the peak_valid_acc is reported in the results summary. Do you report the peak valid acc (averaged over several runs) in the paper?

rogerwaleffe commented 1 year ago

For node classification, yes, we reported peak validation accuracy averaged over three runs. The issue with the node classification accuracy is that there can be a lot of variation between epochs, even after the model is close to convergence. In a real setting, it would be reasonable to monitor the validation accuracy and checkpoint the model which achieved the best performance.

CSLabor commented 1 year ago

Hi @rogerwaleffe ! Got it, thank you for your help!