Open roywei opened 9 months ago
scripts and full logs vit_benchmark.zip
@danthe3rd I think you found that building the flash/mem_eff sourceswith O3 on CUDA > 12 caused ~ 10% perf regression while with O2 it did not. Is that right?
I only saw the regression with the mem_eff one, not with flash. We use O2
in ptxas as a workaround:
https://github.com/facebookresearch/xformers/blob/main/setup.py#L288
Hi @drisspg , not really, we see CUDA 12 helps with models that use flash/mem_eff implementations (Megatron/OPT) or models that use SDPA kernels, but we are not expecting regression for self implemented attention modules like this https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/deepvit.py#L22
Ohhh I misinterpreted this, I am not sure why this is, could you try profiling the two one for cuda 11.8 and one for cuda 12.1 to see which ops are slower?
Here is the documentation on the profilier: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
we will keep looking into it, putting this out here in case someone has similar issues. Interested to see what it takes to get the best out of H100s
CC @eqy could you try to reproduce it, please?
@roywei Looking at the code I couldn't find a sync before the timing functions e.g., (torch.cuda.synchronize
). Are you doing synchronization somewhere else, and if not, could you check if adding these syncs changes the benchmarking results?
A quick update on this, looks like it is due to a GammaBetaBackward kernel being slower. I'll work on a smaller reproducer and forward appropriately if necessary.
Considering the perf regression I propose we keep 11.8 is older CUDA version for 2.2 release
🐛 Describe the bug
Hi, we are benchmarking deep vit models and found out the CUDA 12.1 binary are actually slower than CUDA 11.8 for PyTorch 2.1.0 release. This is not expected as we see better perf on other LLMs like Megatron(GPT) and OPT. One potential reason is those repos are using flash-attention or xformers which benefit from CUDA 12 and transformer engine on H100 GPUs. While deep VIT is using vanilla implementation: https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/deepvit.py. But we are not expecting a regression. Any ideas? thanks!
here is the output showing ~10% regression on single node PT 2.1.0 + CUDA 11.8
PT 2.1.0 + CUDA 12.1
attached training script and reproduciable steps
Versions
PyTorch 2.1.0 cuda 11.8 vs cuda 12.1 tested on AWS P5.48xlarge
cc @ezyang @gchanan @zou3519 @kadeng @ptrblck