hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
31.16k stars 3.84k forks source link

128卡 A800 80G qwen2 7b cut_off 8192报错oom #4805

Open BobTsang1995 opened 2 months ago

BobTsang1995 commented 2 months ago

Reminder

System Info

model

model_name_or_path: /mnt/nas/shanzhi/eval_models/Qwen2-72B

method

stage: sft do_train: true finetuning_type: full

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json

dataset

dataset: model_filing_toxicity,tagengo_train_formatted,google_ift_data_v1,google_ift_data_v2,google_ift_data_v3,self_cognition_aib_multilingual,ultra_chat_200k_train_sft,glaive-code,MetaMathQA,MathInstruct,mh_org_CoT_collection_fr_remove_keywords,mh_org_CoT_collection_ja_remove_keywords,mh_org_CoT_collection_ko_remove_keywords,mh_org_CoT_collection_ru_remove_keywords,mh_org_CoT_collection_zh2_remove_keywords,mh_org_ar_remove_keywords,mh_org_bn_remove_keywords,mh_org_de_remove_keywords,mh_org_en_remove_keywords,mh_org_es_remove_keywords,mh_org_fr_remove_keywords,mh_org_he_remove_keywords,mh_org_id_remove_keywords,mh_org_ja_remove_keywords,mh_org_ko_remove_keywords,mh_org_my_remove_keywords,mh_org_nl_remove_keywords,mh_org_pl_remove_keywords,mh_org_pt_remove_keywords,mh_org_ru_remove_keywords,mh_org_ta_remove_keywords,mh_org_te_remove_keywords,mh_org_th_remove_keywords,mh_org_tr_remove_keywords,mh_org_ur_remove_keywords,mh_org_vi_remove_keywords,mh_org_zh_remove_keywords,mh_org_orca_remove_keywords_1,mh_org_orca_remove_keywords_2,openqa_dedup template: qwen cutoff_len: 8192

max_samples: 1000

overwrite_cache: true preprocessing_num_workers: 128

output

output_dir: /mnt/nas/liyadong/sft_models/Qwen2-7B-alldata-packing-bs1024-lr4e-6-5epoch-32k logging_steps: 10 save_steps: 500 save_strategy: epoch plot_loss: true overwrite_output_dir: true

train

flash_attn: fa2 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 0.000004 num_train_epochs: 5.0 lr_scheduler_type: cosine warmup_steps: 0.1 bf16: true neftune_noise_alpha: 5 packing: true

eval

val_size: 0.1

per_device_eval_batch_size: 1

evaluation_strategy: steps

eval_steps: 500

Reproduction

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 0 has a total capacty of 79.35 GiB of which 64.00 GiB is free. Process 1844 has 15.28 GiB memory in use. Of the allocated memory 9.63 GiB is allocated by PyTorch, and 4.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior

No response

Others

No response

codemayq commented 2 months ago

先尝试下 2K cutoff_len, 看下是数据太长的问题,还是 zero3 没有生效。

Rocky77JHxu commented 2 months ago

(个人理解,因为我A100训练有的时候也会OOM)per_bs=1,应该是每个GPU对每个数据集抽取1个batch去训练,你这么多数据集实例,bs实际上很高,并且是每个gpu都跑一样的bs,可能就会oom。 可以试试把大量数据集合并,然后提高或不更改per_bs的值。

仅供参考,如果有错往指出!

Rocky77JHxu commented 2 months ago

(个人理解,因为我A100训练有的时候也会OOM)per_bs=1,应该是每个GPU对每个数据集抽取1个batch去训练,你这么多数据集实例,bs实际上很高,并且是每个gpu都跑一样的bs,可能就会oom。 可以试试把大量数据集合并,然后提高或不更改per_bs的值。

仅供参考,如果有错往指出!

比如我在只有一个数据集实例的时候,对Qwen1.5-32B做pretrain,我bs可以到8甚至10,也是8192的上下文。但是我初始化了3个数据集实例,我bs只能到3,到4的话会直接oom,并且后者的cutoff只有4096.

Rocky77JHxu commented 2 months ago

另外再多给个建议,数据量比较大的话,eval_step设高一点,我不知道会不会影响训练结果,但是每500steps做一个eval,一次就得做一晚上。。我顶不住

Syno8 commented 2 months ago

这么多卡,zero 3 都训练不了这么长度的7b模型嘛?

oubeichen commented 1 month ago

model_name_or_path: /mnt/nas/shanzhi/eval_models/Qwen2-72B

为啥你这里写的是72B呢?

ShadowTeamCN commented 1 week ago

model_name_or_path: /mnt/nas/shanzhi/eval_models/Qwen2-72B

为啥你这里写的是72B呢?

2k短上下文其实3机 8*80G 就能训练了,他这个128卡 16机肯定哪里有问题

oubeichen commented 1 week ago

model_name_or_path: /mnt/nas/shanzhi/eval_models/Qwen2-72B 为啥你这里写的是72B呢?

2k短上下文其实3机 8*80G 就能训练了,他这个128卡 16机肯定哪里有问题

我感觉他是不是搞错了,他说的7b,但是看他配置写的是72b,是不是跑去训练72b了。而且我感觉128卡 A800 针对这种参数的模型秒天秒地了吧,72b也不至于跑不了。

ShadowTeamCN commented 1 week ago

model_name_or_path: /mnt/nas/shanzhi/eval_models/Qwen2-72B 为啥你这里写的是72B呢?

2k短上下文其实3机 8*80G 就能训练了,他这个128卡 16机肯定哪里有问题

我感觉他是不是搞错了,他说的7b,但是看他配置写的是72b,是不是跑去训练72b了。而且我感觉128卡 A800 针对这种参数的模型秒天秒地了吧,72b也不至于跑不了。

嗯嗯, 3机8卡a800 就能跑qwen 72B了

DJinsis commented 1 week ago

@ShadowTeamCN 想咨询一下,双机8卡启动脚本怎么写的呀,这样就可以吗

FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_full/llama3_full_sft_ds3.yaml FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_full/llama3_full_sft_ds3.yaml

ShadowTeamCN commented 4 days ago

@ShadowTeamCN 想咨询一下,双机8卡启动脚本怎么写的呀,这样就可以吗

FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_full/llama3_full_sft_ds3.yaml FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_full/llama3_full_sft_ds3.yaml

没啥问题, 确保下 ip地址可访问, 防火墙端口打开, 不过192.168.0.1 这个地址一般是路由器网关地址, 你得在自个机器上用ifconfig确认下是不是你机器的局域网地址