Closed 807660937 closed 4 months ago
USE_MODELSCOPE_HUB=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.run \ --nproc_per_node 8 \ --nnodes 1 \ --standalone \ src/train.py examples/water/0508_wa_llama3_8b_lora_sft.yaml
# model model_name_or_path: LLM-Research/Meta-Llama-3-8B-Instruct # method stage: sft do_train: true finetuning_type: lora lora_target: q_proj,v_proj # ddp ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_config.json # dataset dataset: identity_water,alpaca_gpt4_en,alpaca_gpt4_zh,lima,glaive_toolcall,oaast_sft_zh,ruozhiba,identity_water template: llama3 cutoff_len: 8192 max_samples: val_size: 0.01 overwrite_cache: true preprocessing_num_workers: 32 # output output_dir: saves/LLM-Research/Meta-Llama-3-8B-Instruct/lora/sft_wa_0508 logging_steps: 4 save_steps: 200 plot_loss: true overwrite_output_dir: true # train per_device_train_batch_size: 6 gradient_accumulation_steps: 8 learning_rate: 0.0001 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_steps: 0.1 bf16: true # eval per_device_eval_batch_size: 1 evaluation_strategy: steps eval_steps: 100
两次实验几乎稳定复现,看着疑似显存使用一直在增加? 15%|█▍ | 129/882 [40:17<3:41:05, 17.62s/it]Traceback (most recent call last): 8%|▊ | 55/663 [24:56<5:05:23, 30.14s/it]Traceback (most recent call last):
训练一段时间内会稳定出现OOM
No response
降低 batchsize,因为序列长度不一样所以显存会有波动
好,我试一下,所以cut_seqlen是只对超长做截断?所以没做padding对吧
padding 会显著减慢训练速度
Reminder
Reproduction
两次实验几乎稳定复现,看着疑似显存使用一直在增加? 15%|█▍ | 129/882 [40:17<3:41:05, 17.62s/it]Traceback (most recent call last): 8%|▊ | 55/663 [24:56<5:05:23, 30.14s/it]Traceback (most recent call last):
训练一段时间内会稳定出现OOM
Expected behavior
No response
System Info
No response
Others
No response