Regression on CUDA 12.1 for vanilla transformer layer

roywei commented 9 months ago

🐛 Describe the bug

Hi, we are benchmarking deep vit models and found out the CUDA 12.1 binary are actually slower than CUDA 11.8 for PyTorch 2.1.0 release. This is not expected as we see better perf on other LLMs like Megatron(GPT) and OPT. One potential reason is those repos are using flash-attention or xformers which benefit from CUDA 12 and transformer engine on H100 GPUs. While deep VIT is using vanilla implementation: https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/deepvit.py. But we are not expecting a regression. Any ideas? thanks!

here is the output showing ~10% regression on single node PT 2.1.0 + CUDA 11.8

[1,0]<stdout>:time taken for the last 1 steps is 2.420312336000279, the throughput is 13.221434078579314 images/stime taken for forward pass is 0.261821381000118time taken for backward pass is 1.758593597000072
[1,0]<stdout>:Average elapsed time is 2.3975933989263996, throughput is 13.34671675953439peak allocated memory is 25.845419008Average forward time is 0.24408430756846333Average backward time is 1.75420702246313

PT 2.1.0 + CUDA 12.1

[1,0]<stdout>:time taken for the last 1 steps is 2.7207883900000525, the throughput is 11.76129687910032 images/stime taken for forward pass is 0.26708757999995214time taken for backward pass is 1.9990846610003246
[1,0]<stdout>:Average elapsed time is 2.72146025645262, throughput is 11.75839328321167peak allocated memory is 25.845419008Average forward time is 0.26492700765261123Average backward time is 2.001053710136823

attached training script and reproduciable steps

pip install vit-pytorch
pip install packaging
torchrun --nnodes=1 --nproc_per_node=8 train.py

Versions

PyTorch 2.1.0 cuda 11.8 vs cuda 12.1 tested on AWS P5.48xlarge

cc @ezyang @gchanan @zou3519 @kadeng @ptrblck

roywei commented 9 months ago

scripts and full logs vit_benchmark.zip

drisspg commented 9 months ago

@danthe3rd I think you found that building the flash/mem_eff sourceswith O3 on CUDA > 12 caused ~ 10% perf regression while with O2 it did not. Is that right?

danthe3rd commented 9 months ago

I only saw the regression with the mem_eff one, not with flash. We use O2 in ptxas as a workaround: https://github.com/facebookresearch/xformers/blob/main/setup.py#L288

roywei commented 9 months ago

Hi @drisspg , not really, we see CUDA 12 helps with models that use flash/mem_eff implementations (Megatron/OPT) or models that use SDPA kernels, but we are not expecting regression for self implemented attention modules like this https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/deepvit.py#L22

drisspg commented 9 months ago

Ohhh I misinterpreted this, I am not sure why this is, could you try profiling the two one for cuda 11.8 and one for cuda 12.1 to see which ops are slower?

Here is the documentation on the profilier: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html

roywei commented 9 months ago

we will keep looking into it, putting this out here in case someone has similar issues. Interested to see what it takes to get the best out of H100s

ptrblck commented 9 months ago

CC @eqy could you try to reproduce it, please?

eqy commented 9 months ago

@roywei Looking at the code I couldn't find a sync before the timing functions e.g., (torch.cuda.synchronize). Are you doing synchronization somewhere else, and if not, could you check if adding these syncs changes the benchmarking results?

eqy commented 9 months ago

A quick update on this, looks like it is due to a GammaBetaBackward kernel being slower. I'll work on a smaller reproducer and forward appropriately if necessary.

malfet commented 8 months ago

Considering the perf regression I propose we keep 11.8 is older CUDA version for 2.2 release

pytorch / pytorch

Regression on CUDA 12.1 for vanilla transformer layer #111168

🐛 Describe the bug

Versions