Training is slower after using generate on unwrapped model

System Info

- `Accelerate` version: 0.30.1
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /fsx/yoach/env_stable_speech/bin/accelerate
- Python version: 3.9.16
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1999.99 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
    Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

Use the following script
run accelerate launch run.py --mixed_precision "bf16"

I'm using 2 GPUs here

Expected behavior

I'm currently using accelerate to train my own LLMs! When generating during evaluation to check the model quality, I've observed a much slower training time after having generated once!

As you can see on the logs (here's an example), the training is much slower after having used generate_step!

When removing the generation part, the training is as fast as expected!

cc @SunMarc and @muellerzr !

huggingface / accelerate

Training is slower after using generate on unwrapped model #2846

System Info

Information

Tasks

Reproduction

Expected behavior