910b qwen72b lora微调OOM

wphtrying commented 5 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

训练环境： 2 * 910b 32G显卡

启动脚本： export MASTER_HOST="$VC_WORKER_HOSTS" export MASTER_ADDR="${VC_WORKER_HOSTS%%,}" export MASTER_ADDR=$(ping "$MASTER_ADDR" -c 1 | sed '1{s/[^(](//;s/).*//;q}') export NNODES="$MA_NUM_HOSTS" export RANK="$VC_TASK_INDEX" export NODE_RANK="$VC_TASK_INDEX" export MASTER_PORT=30008 export NGPUS_PER_NODE="$MA_NUM_GPUS"

llamafactory-cli env llamafactory-cli train -h llamafactory-cli train examples/lora_multi_npu/llama3_lora_sft_ds_8_wph.yaml

错误日志： RuntimeError: NPU out of memory. Tried to allocate 130.00 MiB (NPU 0; 29.50 GiB total capacity; 28.45 GiB already allocated; 28.45 GiB current active; 31.40 MiB free; 28.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. new_value = value.to(device) 5297 new_value = value.to(device) 5298
5299 RuntimeError: NPU out of memory. Tried to allocate 130.00 MiB (NPU 2; 29.50 GiB total capacity; 28.45 GiB already allocated; 28.45 GiB current active; 39.18 MiB free; 28.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. 5300 RuntimeError: NPU out of memory. Tried to allocate 130.00 MiB (NPU 3; 29.50 GiB total capacity; 28.45 GiB already allocated; 28.45 GiB current active; 35.10 MiB free; 28.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

Reproduction

启动脚本： yaml配置

model

model_name_or_path: /model_dir/Qwen1.5-72B-Chat

method

stage: sft do_train: true finetuning_type: lora lora_target: all

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z0_config.json

dataset

dataset: identity,alpaca_en_demo template: qwen cutoff_len: 4096 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: /dataset/data_dir/qwen2_sft/test logging_steps: 1 save_steps: 200 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true

eval

do_eval: true val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

Expected behavior

期望微调成功

Others

No response

hiyouga commented 4 months ago

你的设备不够微调 72b

wphtrying commented 4 months ago

你的设备不够微调 72b

需要多少节点才能进行微调

hiyouga / LLaMA-Factory