Segmentation fault when Lora DPO with Phi-3-mini-128k-instruct

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

CUDA_VISIBLE_DEVICES=4,5,6,7 llamafactory-cli train phi-3-mini-128k-dpo-0518.yaml

### model
model_name_or_path: microsoft/Phi-3-mini-128k-instruct

### method
stage: dpo
do_train: true
finetuning_type: lora
lora_target: all #qkv_proj
lora_rank: 16
lora_alpha: 16
dpo_ftx: 1.0

### dataset
dataset: orca_pairs
template: phi
cutoff_len: 4096
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: ~/workspace/saves/Phi-3-mini-128k-instruct/lora/sft/dpo-0518
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 0.000005
num_train_epochs: 4.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

### report
report_to: wandb
run_name: phi-3-mini-128k-dpo-0518

Expected behavior

No response

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.41.0
Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.30.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Others

  0%|                                                                                                         | 0/112 [00:00<?, ?it/s]
You are not running the flash-attention implementation, expect numerical differences.
~/miniconda3/envs/llama-factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
Segmentation fault (core dumped)

hiyouga / LLaMA-Factory

Segmentation fault when Lora DPO with Phi-3-mini-128k-instruct #3801

Reminder

Reproduction

Expected behavior

System Info

Others