RuntimeError: [enforce fail at CPUAllocator.cpp:71]

lightaime / deep_gcns_torch

Pytorch Repo for DeepGCNs (ICCV'2019 Oral, TPAMI'2021), DeeperGCN (arXiv'2020) and GNN1000(ICML'2021): https://www.deepgcns.org

MIT License

1.13k stars 155 forks source link

RuntimeError: [enforce fail at CPUAllocator.cpp:71] #92

Closed xo28 closed 2 years ago

xo28 commented 2 years ago

Hi!

When I try to run main.py in ogbn-products, I get this error: RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 64597662208 bytes. Error code 12 (Cannot allocate memory)

This error occurs in the test function model.to(cpu). Why do we need to do test in cpu? Especially it requires about 60GB memory? Thanks!

lightaime commented 2 years ago

@xo28 Sorry for the late reply. The current implementation of the full batch testing on ogbn-products is not memory efficient. It takes 405 G RAM to do inference on the whole graph.

An alternative way to evaluate the model could be implemented by multi-min-batch inference like the example on ogbn_proteins by averaging the predictions on multiple random different partitionings: https://github.com/lightaime/deep_gcns_torch/blob/7885181484978fbf3839bf0e929fb1c2484d0a7d/examples/ogb_eff/ogbn_proteins/test.py#L145 NVIDIA AMP (Automatic Mixed Precision) is also recommended to save the inference memory: https://github.com/lightaime/deep_gcns_torch/blob/7885181484978fbf3839bf0e929fb1c2484d0a7d/examples/ogb_eff/ogbn_proteins/test.py#L144

Hope this helps.

xo28 commented 2 years ago

Thank you for your response! Have you tried to do testing like ogbn-protein before?

lightaime commented 2 years ago

The single mini-batch inference is used during the training on ogbn-products. But sorry that I did not try multiple mini-batch inferences on ogbn-products. We should expect the multiple mini-batch inferences to yield better results since multi-inferences can be considered as some sort of test-time data augmentation.

xo28 commented 2 years ago

Got it, I'll give it a try. Thanks!

xo28 commented 2 years ago

Hi Guohao,

I add graph partitions to the model, but the accuracy over 10 subgraphs is 72% which is much lower than 80.98%09/19 12:36:28 AM {'highest_valid': 0.9017877578007781, 'final_train': 0, 'final_test': 0.7260320520032841, 'highest_train': 0}.

Are you using the same training setting reported in the repo？Thanks!!

lightaime commented 2 years ago

Hi @xo28. How many time of inferences you did to get the predictions? As mentioned, the graph should be partitioned multi times to obtain average predictions.

It would be better if you could evaluate your trained model with enough RAM on CPUs. Alternatively, you could also send me the checkpoint. I can look into it.

lightaime commented 2 years ago

Hi @xo28. I tried multi-inferences with mini-batch sampling on ogbn-products. The performance gap is significant compared to full-batch inference. I tested one trained model with full-batch inference on CPU and got 'final_test' as 0.8114 acc while for mini-batch sampling with 5 partitions and 10 partitions on GPU I got 0.7845 acc and 0.7561 acc respectively. Therefore, to obtain the best performance, please try to conduct the evaluation on CPU with enough RAM (405+G).

lightaime commented 2 years ago

A memory efficient implementation is available on PyG https://github.com/pyg-team/pytorch_geometric/blob/master/examples/rev_gnn.py. Now you can do full batch testing on a GPU with more than 20G memory. Closing this issue. Let me know if you have any further questions.