huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.34k stars 875 forks source link

Training is slower after using generate on unwrapped model #2846

Open ylacombe opened 3 weeks ago

ylacombe commented 3 weeks ago

System Info

- `Accelerate` version: 0.30.1
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /fsx/yoach/env_stable_speech/bin/accelerate
- Python version: 3.9.16
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1999.99 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
    Not found

Information

Tasks

Reproduction

  1. Use the following script
  2. run accelerate launch run.py --mixed_precision "bf16"

I'm using 2 GPUs here

Expected behavior

I'm currently using accelerate to train my own LLMs! When generating during evaluation to check the model quality, I've observed a much slower training time after having generated once!

As you can see on the logs (here's an example), the training is much slower after having used generate_step! image

When removing the generation part, the training is as fast as expected!

cc @SunMarc and @muellerzr !

SunMarc commented 1 week ago

Issue solved offline ! could you share the answer @ylacombe when you have a bit of time ?