LLaVA_dpo跑不了 - Githubissues

zsworld6 commented 1 month ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-4.19.91-014.15-kangaroo.alios7.x86_64-x86_64-with-glibc2.35
Python version: 3.10.15
PyTorch version: 2.5.0+cu124 (GPU)
Transformers version: 4.45.2
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A100-SXM4-80GB
DeepSpeed version: 0.15.3

Reproduction

llamafactory-cli train --stage dpo --do_train True --model_name_or_path --preprocessing_num_workers 16 --finetuning_type full --template llava --flash_attn auto --dataset_dir data --dataset --cutoff_len 1024 --learning_rate 5e-07 --num_train_epochs 3.0 --max_samples 100000 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --optim adamw_torch --packing False --report_to wandb --output_dir --bf16 True --plot_loss True --ddp_timeout 180000000 --include_num_input_tokens_seen True --lora_rank 8 --lora_alpha 16 --lora_dropout 0 --lora_target all --pref_beta 0.1 --pref_ftx 0 --pref_loss sigmoid --deepspeed cache/ds_z3_config.json

Expected behavior

No response

Others

0%| | 0/840 [00:00<?, ?it/s]

一直卡在这里动不了，然后被中断了

NathanaelTamirat commented 1 month ago

@zsworld6 did you solve this issue ?

zsworld6 commented 1 month ago

@zsworld6 did you solve this issue ?

Not yet

alexlai2860 commented 1 month ago

请问现在解决了吗？

zsworld6 commented 1 month ago

请问现在解决了吗？

没有

zuojie2024 commented 6 days ago

Try change “--deepspeed cache/ds_z3_config.json” to “--deepspeed cache/ds_z0_config.json”

hiyouga / LLaMA-Factory

LLaVA_dpo跑不了 #5812

Reminder

System Info

Reproduction

Expected behavior

Others