Closed xo28 closed 2 years ago
@xo28 Sorry for the late reply. The current implementation of the full batch testing on ogbn-products is not memory efficient. It takes 405 G RAM to do inference on the whole graph.
An alternative way to evaluate the model could be implemented by multi-min-batch inference like the example on ogbn_proteins by averaging the predictions on multiple random different partitionings: https://github.com/lightaime/deep_gcns_torch/blob/7885181484978fbf3839bf0e929fb1c2484d0a7d/examples/ogb_eff/ogbn_proteins/test.py#L145 NVIDIA AMP (Automatic Mixed Precision) is also recommended to save the inference memory: https://github.com/lightaime/deep_gcns_torch/blob/7885181484978fbf3839bf0e929fb1c2484d0a7d/examples/ogb_eff/ogbn_proteins/test.py#L144
Hope this helps.
Thank you for your response! Have you tried to do testing like ogbn-protein before?
The single mini-batch inference is used during the training on ogbn-products. But sorry that I did not try multiple mini-batch inferences on ogbn-products. We should expect the multiple mini-batch inferences to yield better results since multi-inferences can be considered as some sort of test-time data augmentation.
Got it, I'll give it a try. Thanks!
Hi Guohao,
I add graph partitions to the model, but the accuracy over 10 subgraphs is 72% which is much lower than 80.98%09/19 12:36:28 AM {'highest_valid': 0.9017877578007781, 'final_train': 0, 'final_test': 0.7260320520032841, 'highest_train': 0}
.
Are you using the same training setting reported in the repo?Thanks!!
Hi @xo28. How many time of inferences you did to get the predictions? As mentioned, the graph should be partitioned multi times to obtain average predictions.
It would be better if you could evaluate your trained model with enough RAM on CPUs. Alternatively, you could also send me the checkpoint. I can look into it.
Hi @xo28. I tried multi-inferences with mini-batch sampling on ogbn-products. The performance gap is significant compared to full-batch inference. I tested one trained model with full-batch inference on CPU and got 'final_test' as 0.8114 acc while for mini-batch sampling with 5 partitions and 10 partitions on GPU I got 0.7845 acc and 0.7561 acc respectively. Therefore, to obtain the best performance, please try to conduct the evaluation on CPU with enough RAM (405+G).
A memory efficient implementation is available on PyG https://github.com/pyg-team/pytorch_geometric/blob/master/examples/rev_gnn.py. Now you can do full batch testing on a GPU with more than 20G memory. Closing this issue. Let me know if you have any further questions.
Hi!
When I try to run main.py in ogbn-products, I get this error:
RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 64597662208 bytes. Error code 12 (Cannot allocate memory)
This error occurs in the test function
model.to(cpu)
. Why do we need to do test in cpu? Especially it requires about 60GB memory? Thanks!