googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.21k stars 728 forks source link

Cannot train HF-model with mixed precision when using A100 and L4 GPUs #4948

Open bialczykk opened 2 weeks ago

bialczykk commented 2 weeks ago

Describe the current behavior I just purchased Colab Pro+, as I need faster GPU units for LLM fine-tuning for CausalLM with the use of HuggingFace's Trainer API. It is working fine when I implement everything on the T4 GPU (mixed precision fp16 as training argument as it significantly speeds up the training). However, when I run the same code on A100 or L4 runtimes, the Trainer (with exactly the same config) starts, but during validation steps does not calculate Train & Val loss, which normally works in case of T4. When I remove the arguments related to mixed precision, it works normally on A100 and L4, but without significant speed-up or even slower than on T4.

Describe the expected behavior I want the trainer API to run normally when using the A100 runtime and implementing mixed precision, i.e. the loss is calculated over epochs.

What web browser you are using Chrome

Additional context There is also this Warning message appearing when trainer.train() starts: /usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Screenshot 2024-11-14 at 18 54 23

Link to a minimal, public, self-contained notebook that reproduces this issue. To be added...

cperry-goog commented 2 weeks ago

I don't know that this is a Colab issue - I'd wager a lot has to do with NVIDIA support for the modules you're using. Do you have data to suggest this is not an issue with the hardware?