Open qy1026 opened 5 days ago
I had the same problem using a slightly different config:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--config_file ./examples/accelerate/fsdp_config.yaml \
./src/train_bash.py \
--stage dpo \
--do_train \
--model_name_or_path my_model \
--dataset my_dataset \
--template maritalk \
--split train \
--finetuning_type full \
--dpo_beta=0.1 \
--cutoff_len 2048 \
--max_length 2048 \
--max_new_tokens 2048 \
--output_dir ~/my_output \
--overwrite_cache \
--overwrite_output_dir \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 500 \
--learning_rate 1e-5 \
--warmup_ratio=0.1 \
--num_train_epochs 3.0 \
--plot_loss \
--fp16 \
--use_fast_tokenizer=True
Reminder
System Info
pass
Reproduction
Error message:
[rank1]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Expected behavior
No response
Others
llama3_full_dpo_fsdp.yaml