Cannot train HF-model with mixed precision when using A100 and L4 GPUs

Describe the current behavior I just purchased Colab Pro+, as I need faster GPU units for LLM fine-tuning for CausalLM with the use of HuggingFace's Trainer API. It is working fine when I implement everything on the T4 GPU (mixed precision fp16 as training argument as it significantly speeds up the training). However, when I run the same code on A100 or L4 runtimes, the Trainer (with exactly the same config) starts, but during validation steps does not calculate Train & Val loss, which normally works in case of T4. When I remove the arguments related to mixed precision, it works normally on A100 and L4, but without significant speed-up or even slower than on T4.

Describe the expected behavior I want the trainer API to run normally when using the A100 runtime and implementing mixed precision, i.e. the loss is calculated over epochs.

What web browser you are using Chrome

Additional context There is also this Warning message appearing when trainer.train() starts: /usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Screenshot 2024-11-14 at 18 54 23

Link to a minimal, public, self-contained notebook that reproduces this issue. To be added...

googlecolab / colabtools

Cannot train HF-model with mixed precision when using A100 and L4 GPUs #4948