CPU/CUDA device error with `supervised_finetuning.py`

kl2004 commented 1 year ago

Hi all, I'm trying the latest version of supervised_finetuning.py and ran into the error:

ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module 
parameters locate in {'cpu', 'cuda'}.

The full command is:

torchrun --nnodes 1  --nproc_per_node 1 examples/stack_llama/scripts/supervised_finetuning.py \
--model_path gpt2 --streaming --no_gradient_checkpointing --learning_rate 1e-5 \
--max_steps 5000 --output_dir gpt2-se

CUDA is available on the machine:

>>> import torch
>>> torch.cuda.is_available()
True

Installed packages:

bitsandbytes             0.38.1
torch                    2.0.0
transformers             4.28.1
trl                      0.4.2.dev0

younesbelkada commented 1 year ago

Hi @kl2004 Thanks for the issue, I have managed to successfully run the command you provided, can you make sure you upgrade your transformers version, for example by installing it from source?

pip install git+https://github.com/huggingface/transformers.git

kl2004 commented 1 year ago

Hi @younesbelkada, I've reinstalled transformers, accelerate, torch and trl. I could finetune gpt2 now. Thanks for your help!

For reference, these are the versions that work for me:

accelerate==0.19.0
torch==1.13.1
transformers @ git+https://github.com/huggingface/transformers.git@273f5ba0266b223c1d611bd00d4a4b2d58771a33
-e git+https://github.com/lvwerra/trl@31cc361d1749bb385e205b211f0c2f1f51e7bd26#egg=trl

huggingface / trl

CPU/CUDA device error with `supervised_finetuning.py` #338