hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
31.15k stars 3.84k forks source link

910b 280t 8机64卡 跑qwen2 7b cutoff_len: 8192 一直OOM #4572

Closed wphtrying closed 2 months ago

wphtrying commented 2 months ago

Reminder

System Info

CANN | cann_8.0.rc1 -- | -- PyTorch | pytorch_2.1.0 PyTorch_npu | 2.1.0.post3-20240413

Reproduction

model

model_name_or_path: ${model_dir}/model/ cache_dir: /cache flash_attn: sdpa gradient_checkpointing: true logging_dir: ${output_dir}/log

method

stage: sft finetuning_type: full

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json

dataset

dataset_dir: ${data_dir}/gaokao/ dataset: long template: qwen cutoff_len: 8192 packing: true

max_samples: 1000 #实际训练全量数据,需注释

overwrite_cache: true preprocessing_num_workers: 16

output

save_strategy: epoch #每个epoch保存一次

save_steps: 0.5 #save_strategy为steps生效,0~1之间为总step的比例,>1为实际step

output_dir: ${data_dir}/ logging_steps: 1 plot_loss: true overwrite_output_dir: true save_only_model: false#为true不保存优化器,同时也不能继续跑

train

do_train: true resume_from_checkpoint: false per_device_train_batch_size: 1 gradient_accumulation_steps: 1 learning_rate: 2.0e-5 num_train_epochs: 4.0 lr_scheduler_type: cosine bf16: true warmup_ratio: 0.05 #学习率预热,5%的step总次数 report_to: tensorboard

eval

eval_strategy: steps val_size: 0.01 #eval数据集的比例 per_device_eval_batch_size: 1 eval_steps: 0.01 #1%的step总数eval一次,一共100个点

untimeErrorRuntimeError: : NPU out of memory. Tried to allocate 32.00 GiB (NPU 4; 29.50 GiB total capacity; 16.53 GiB already allocated; 16.53 GiB current active; 11.74 GiB free; 16.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.RuntimeError NPU out of memory. Tried to allocate 32.00 GiB (NPU 6; 29.50 GiB total capacity; 16.53 GiB already allocated; 16.53 GiB current active; 11.74 GiB free; 16.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.:

Expected behavior

No response

Others

No response

hiyouga commented 2 months ago

使用 deepspeed zero3 并减小 cutoff len