Open mario-linov opened 9 months ago
Thanks for reporting this issue :) I see very sparse GPU compute usage in my environment, and so I feel like these examples may be too small to see the benefits from compile
.
To possibly see some benefits, what you could try are:
On my machine, I see the following times:
Running examples/gcn.py
: Median time per epoch: 0.0057s
Running examples/compile/gcn.py
: Median time per epoch: 0.0029s
Thank you for your answers.
In my RTX3090 I get:
Running examples/gcn.py
: Median time per epoch: 0.0024s
Running examples/compile/gcn.py
: Median time per epoch: 0.0029s
when I run it on the CPU I get
Running examples/gcn.py
: Median time per epoch: 0.0112s
Running examples/compile/gcn.py
: Median time per epoch: 0.0109s
I tried increasing the batch size and the size of the model and I do not any improvement.
I am getting this warning message, could it be the reason?
/miniconda3/envs/pyg/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:135: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting
torch.set_float32_matmul_precision('high')
for better performance.
When doing as indicate in this warning the runtime is not modified.
I do not know what could I be missing. It though it is supposed to be straighforward.
If your script has a bottleneck in compute or memory bandwidth, setting torch.set_float32_matmul_precision('high')
should improve the throughput (on GPUs that support TF32). torch.compile
should also improve the throughput if the throughput is bounded by memory bandwidth.
However, since none of these introduces any improvement on your hardware, I suspect that the script is too small to benchmark and that the performance is bounded by some code outside the compiled model or by some overhead, but maybe @rusty1s might have some other ideas.
One thing you could try is to profile your script with and without torch.compile
using https://pytorch.org/docs/main/profiler.html and compare their results to see what might be the cause. Here's an example using torch.profiler.profile
: https://github.com/akihironitta/gist/blob/8ae6a01ecdb3471d2848c77aeeeb95b5a8288323/torch_contiguous/main.py#L38-L42
I agree that the GCN example might not be a perfect example for measuring speed-ups during torch.compile
. 2 layers of with 16 feature dimension is probably something you won't see big improvements. What happens if you increase the feature size to something like 128 or 256?
🐛 Describe the bug
I have run the scripts in [examples/compile] (https://github.com/pyg-team/pytorch_geometric/tree/master/examples/compile) to compare the performance of compile vs. not compiled. The improvement with the compiled model is only 4%. I am using PyTorch 2.1.2 and PyG 2.4.0 on a RTX3090.
Versions
PyTorch 2.1.2 and PyG 2.4.0