pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
20.99k stars 3.62k forks source link

torch_geometric.compile does not speed up training #8779

Open mario-linov opened 8 months ago

mario-linov commented 8 months ago

🐛 Describe the bug

I have run the scripts in [examples/compile] (https://github.com/pyg-team/pytorch_geometric/tree/master/examples/compile) to compare the performance of compile vs. not compiled. The improvement with the compiled model is only 4%. I am using PyTorch 2.1.2 and PyG 2.4.0 on a RTX3090.

Versions

PyTorch 2.1.2 and PyG 2.4.0

akihironitta commented 8 months ago

Thanks for reporting this issue :) I see very sparse GPU compute usage in my environment, and so I feel like these examples may be too small to see the benefits from compile.

To possibly see some benefits, what you could try are:

rusty1s commented 8 months ago

On my machine, I see the following times:

Running examples/gcn.py: Median time per epoch: 0.0057s Running examples/compile/gcn.py: Median time per epoch: 0.0029s

mario-linov commented 8 months ago

Thank you for your answers.

In my RTX3090 I get:

Running examples/gcn.py: Median time per epoch: 0.0024s Running examples/compile/gcn.py: Median time per epoch: 0.0029s

when I run it on the CPU I get

Running examples/gcn.py: Median time per epoch: 0.0112s Running examples/compile/gcn.py: Median time per epoch: 0.0109s

I tried increasing the batch size and the size of the model and I do not any improvement.

I am getting this warning message, could it be the reason?

/miniconda3/envs/pyg/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:135: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting torch.set_float32_matmul_precision('high') for better performance.

When doing as indicate in this warning the runtime is not modified.

I do not know what could I be missing. It though it is supposed to be straighforward.

akihironitta commented 8 months ago

If your script has a bottleneck in compute or memory bandwidth, setting torch.set_float32_matmul_precision('high') should improve the throughput (on GPUs that support TF32). torch.compile should also improve the throughput if the throughput is bounded by memory bandwidth.

However, since none of these introduces any improvement on your hardware, I suspect that the script is too small to benchmark and that the performance is bounded by some code outside the compiled model or by some overhead, but maybe @rusty1s might have some other ideas.

One thing you could try is to profile your script with and without torch.compile using https://pytorch.org/docs/main/profiler.html and compare their results to see what might be the cause. Here's an example using torch.profiler.profile: https://github.com/akihironitta/gist/blob/8ae6a01ecdb3471d2848c77aeeeb95b5a8288323/torch_contiguous/main.py#L38-L42

rusty1s commented 8 months ago

I agree that the GCN example might not be a perfect example for measuring speed-ups during torch.compile. 2 layers of with 16 feature dimension is probably something you won't see big improvements. What happens if you increase the feature size to something like 128 or 256?