huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10k stars 1.27k forks source link

CPU/CUDA device error with `supervised_finetuning.py` #338

Closed kl2004 closed 1 year ago

kl2004 commented 1 year ago

Hi all, I'm trying the latest version of supervised_finetuning.py and ran into the error:

ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module 
parameters locate in {'cpu', 'cuda'}. 

The full command is:

torchrun --nnodes 1  --nproc_per_node 1 examples/stack_llama/scripts/supervised_finetuning.py \
--model_path gpt2 --streaming --no_gradient_checkpointing --learning_rate 1e-5 \
--max_steps 5000 --output_dir gpt2-se

CUDA is available on the machine:

>>> import torch
>>> torch.cuda.is_available()
True

Installed packages:

bitsandbytes             0.38.1
torch                    2.0.0
transformers             4.28.1
trl                      0.4.2.dev0
younesbelkada commented 1 year ago

Hi @kl2004 Thanks for the issue, I have managed to successfully run the command you provided, can you make sure you upgrade your transformers version, for example by installing it from source?

pip install git+https://github.com/huggingface/transformers.git
kl2004 commented 1 year ago

Hi @younesbelkada, I've reinstalled transformers, accelerate, torch and trl. I could finetune gpt2 now. Thanks for your help!

For reference, these are the versions that work for me:

accelerate==0.19.0
torch==1.13.1
transformers @ git+https://github.com/huggingface/transformers.git@273f5ba0266b223c1d611bd00d4a4b2d58771a33
-e git+https://github.com/lvwerra/trl@31cc361d1749bb385e205b211f0c2f1f51e7bd26#egg=trl