model_name_or_path: /mnt/nas/shanzhi/eval_models/Qwen2-7B

model_name_or_path: /mnt/nas/liyadong/sft_models/checkpoint-3945 enable_liger_kernel: true use_unsloth_gc: true

method

stage: dpo do_train: true finetuning_type: full pref_beta: 0.01 dpo_label_smoothing: 0.05 pref_loss: sigmoid # choices: [sigmoid (dpo), orpo, simpo] optim: paged_adamw_32bit

dataset

dataset: ultrafeedback_binarized_train_dpo,multilingual_ultrafeedback_binarized_train_dpo template: qwen cutoff_len: 11008

packing: true

max_samples: 1000

overwrite_cache: true preprocessing_num_workers: 128

output

output_dir: /mnt/nas/liyadong/sft_models/qwen2_72b_dpo_ct_sft_dpo logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 1 learning_rate: 5.0e-7 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.03 flash_attn: fa2 bf16: true repetition_penalty: 1.2 neftune_noise_alpha: 5

ddp

deepspeed: examples/deepspeed/ds_z3_config.json ddp_timeout: 180000000

eval

val_size: 0.1

per_device_eval_batch_size: 1

eval_strategy: steps

eval_steps: 500

Reproduction

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.76 GiB. GPU 6 has a total capacity of 79.35 GiB of which 2.19 MiB is free. Process 1438 has 78.69 GiB memory in use. Of the allocated memory 59.53 GiB is allocated by PyTorch, and 17.67 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

已经开了paged_adamw_32bit，use_unsloth_gc，enable_liger_kernel 而且bs也设置到最小了

Others

No response

hiyouga / LLaMA-Factory

dpo qwen2-72b oom，9*8 A800 80G需要怎么设置？ #5838

Reminder

System Info

model