错误日志:
RuntimeError: NPU out of memory. Tried to allocate 130.00 MiB (NPU 0; 29.50 GiB total capacity; 28.45 GiB already allocated; 28.45 GiB current active; 31.40 MiB free; 28.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. new_value = value.to(device)
5297 new_value = value.to(device)
5298
5299 RuntimeError: NPU out of memory. Tried to allocate 130.00 MiB (NPU 2; 29.50 GiB total capacity; 28.45 GiB already allocated; 28.45 GiB current active; 39.18 MiB free; 28.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
5300 RuntimeError: NPU out of memory. Tried to allocate 130.00 MiB (NPU 3; 29.50 GiB total capacity; 28.45 GiB already allocated; 28.45 GiB current active; 35.10 MiB free; 28.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
Reproduction
启动脚本:
yaml配置
model
model_name_or_path: /model_dir/Qwen1.5-72B-Chat
method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
Reminder
System Info
训练环境: 2 * 910b 32G显卡
启动脚本: export MASTER_HOST="$VC_WORKER_HOSTS" export MASTER_ADDR="${VC_WORKER_HOSTS%%,}" export MASTER_ADDR=$(ping "$MASTER_ADDR" -c 1 | sed '1{s/[^(](//;s/).*//;q}') export NNODES="$MA_NUM_HOSTS" export RANK="$VC_TASK_INDEX" export NODE_RANK="$VC_TASK_INDEX" export MASTER_PORT=30008 export NGPUS_PER_NODE="$MA_NUM_GPUS"
llamafactory-cli env llamafactory-cli train -h llamafactory-cli train examples/lora_multi_npu/llama3_lora_sft_ds_8_wph.yaml
错误日志: RuntimeError: NPU out of memory. Tried to allocate 130.00 MiB (NPU 0; 29.50 GiB total capacity; 28.45 GiB already allocated; 28.45 GiB current active; 31.40 MiB free; 28.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. new_value = value.to(device) 5297 new_value = value.to(device) 5298
5299 RuntimeError: NPU out of memory. Tried to allocate 130.00 MiB (NPU 2; 29.50 GiB total capacity; 28.45 GiB already allocated; 28.45 GiB current active; 39.18 MiB free; 28.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. 5300 RuntimeError: NPU out of memory. Tried to allocate 130.00 MiB (NPU 3; 29.50 GiB total capacity; 28.45 GiB already allocated; 28.45 GiB current active; 35.10 MiB free; 28.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
Reproduction
启动脚本: yaml配置
model
model_name_or_path: /model_dir/Qwen1.5-72B-Chat
method
stage: sft do_train: true finetuning_type: lora lora_target: all
ddp
ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z0_config.json
dataset
dataset: identity,alpaca_en_demo template: qwen cutoff_len: 4096 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: /dataset/data_dir/qwen2_sft/test logging_steps: 1 save_steps: 200 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true
eval
do_eval: true val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500
Expected behavior
期望微调成功
Others
No response