untimeErrorRuntimeError: : NPU out of memory. Tried to allocate 32.00 GiB (NPU 4; 29.50 GiB total capacity; 16.53 GiB already allocated; 16.53 GiB current active; 11.74 GiB free; 16.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.RuntimeError NPU out of memory. Tried to allocate 32.00 GiB (NPU 6; 29.50 GiB total capacity; 16.53 GiB already allocated; 16.53 GiB current active; 11.74 GiB free; 16.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.:
Reminder
System Info
CANN | cann_8.0.rc1 -- | -- PyTorch | pytorch_2.1.0 PyTorch_npu | 2.1.0.post3-20240413Reproduction
model
model_name_or_path: ${model_dir}/model/ cache_dir: /cache flash_attn: sdpa gradient_checkpointing: true logging_dir: ${output_dir}/log
method
stage: sft finetuning_type: full
ddp
ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json
dataset
dataset_dir: ${data_dir}/gaokao/ dataset: long template: qwen cutoff_len: 8192 packing: true
max_samples: 1000 #实际训练全量数据,需注释
overwrite_cache: true preprocessing_num_workers: 16
output
save_strategy: epoch #每个epoch保存一次
save_steps: 0.5 #save_strategy为steps生效,0~1之间为总step的比例,>1为实际step
output_dir: ${data_dir}/ logging_steps: 1 plot_loss: true overwrite_output_dir: true save_only_model: false#为true不保存优化器,同时也不能继续跑
train
do_train: true resume_from_checkpoint: false per_device_train_batch_size: 1 gradient_accumulation_steps: 1 learning_rate: 2.0e-5 num_train_epochs: 4.0 lr_scheduler_type: cosine bf16: true warmup_ratio: 0.05 #学习率预热,5%的step总次数 report_to: tensorboard
eval
eval_strategy: steps val_size: 0.01 #eval数据集的比例 per_device_eval_batch_size: 1 eval_steps: 0.01 #1%的step总数eval一次,一共100个点
untimeErrorRuntimeError: : NPU out of memory. Tried to allocate 32.00 GiB (NPU 4; 29.50 GiB total capacity; 16.53 GiB already allocated; 16.53 GiB current active; 11.74 GiB free; 16.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.RuntimeError NPU out of memory. Tried to allocate 32.00 GiB (NPU 6; 29.50 GiB total capacity; 16.53 GiB already allocated; 16.53 GiB current active; 11.74 GiB free; 16.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.:
Expected behavior
No response
Others
No response