torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.76 GiB. GPU 6 has a total capacity of 79.35 GiB of which 2.19 MiB is free. Process 1438 has 78.69 GiB memory in use. Of the allocated memory 59.53 GiB is allocated by PyTorch, and 17.67 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Reminder
System Info
model
model_name_or_path: /mnt/nas/shanzhi/eval_models/Qwen2-7B
model_name_or_path: /mnt/nas/liyadong/sft_models/checkpoint-3945 enable_liger_kernel: true use_unsloth_gc: true
method
stage: dpo do_train: true finetuning_type: full pref_beta: 0.01 dpo_label_smoothing: 0.05 pref_loss: sigmoid # choices: [sigmoid (dpo), orpo, simpo] optim: paged_adamw_32bit
dataset
dataset: ultrafeedback_binarized_train_dpo,multilingual_ultrafeedback_binarized_train_dpo template: qwen cutoff_len: 11008
packing: true
max_samples: 1000
overwrite_cache: true preprocessing_num_workers: 128
output
output_dir: /mnt/nas/liyadong/sft_models/qwen2_72b_dpo_ct_sft_dpo logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 1 learning_rate: 5.0e-7 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.03 flash_attn: fa2 bf16: true repetition_penalty: 1.2 neftune_noise_alpha: 5
ddp
deepspeed: examples/deepspeed/ds_z3_config.json ddp_timeout: 180000000
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
Reproduction
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.76 GiB. GPU 6 has a total capacity of 79.35 GiB of which 2.19 MiB is free. Process 1438 has 78.69 GiB memory in use. Of the allocated memory 59.53 GiB is allocated by PyTorch, and 17.67 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
已经开了paged_adamw_32bit,use_unsloth_gc,enable_liger_kernel 而且bs也设置到最小了
Others
No response