Describe the current behavior
I just purchased Colab Pro+, as I need faster GPU units for LLM fine-tuning for CausalLM with the use of HuggingFace's Trainer API. It is working fine when I implement everything on the T4 GPU (mixed precision fp16 as training argument as it significantly speeds up the training). However, when I run the same code on A100 or L4 runtimes, the Trainer (with exactly the same config) starts, but during validation steps does not calculate Train & Val loss, which normally works in case of T4. When I remove the arguments related to mixed precision, it works normally on A100 and L4, but without significant speed-up or even slower than on T4.
Describe the expected behavior
I want the trainer API to run normally when using the A100 runtime and implementing mixed precision, i.e. the loss is calculated over epochs.
What web browser you are using
Chrome
Additional context
There is also this Warning message appearing when trainer.train() starts:
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Link to a minimal, public, self-contained notebook that reproduces this issue.
To be added...
I don't know that this is a Colab issue - I'd wager a lot has to do with NVIDIA support for the modules you're using. Do you have data to suggest this is not an issue with the hardware?
Describe the current behavior I just purchased Colab Pro+, as I need faster GPU units for LLM fine-tuning for CausalLM with the use of HuggingFace's Trainer API. It is working fine when I implement everything on the T4 GPU (mixed precision fp16 as training argument as it significantly speeds up the training). However, when I run the same code on A100 or L4 runtimes, the Trainer (with exactly the same config) starts, but during validation steps does not calculate Train & Val loss, which normally works in case of T4. When I remove the arguments related to mixed precision, it works normally on A100 and L4, but without significant speed-up or even slower than on T4.
Describe the expected behavior I want the trainer API to run normally when using the A100 runtime and implementing mixed precision, i.e. the loss is calculated over epochs.
What web browser you are using Chrome
Additional context There is also this Warning message appearing when trainer.train() starts:
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Link to a minimal, public, self-contained notebook that reproduces this issue. To be added...