hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32k stars 3.92k forks source link

单机多卡全量微调报timeout的错 #662

Closed acadaiaca closed 1 year ago

acadaiaca commented 1 year ago

单机多卡全量微调,使用deepspeed的stage3,每次到30分钟报错timeout: 1692844090428

查明原因是由于数据量大,分词需要花很长时间,而pytorch TORCH.DISTRIBUTED.DISTRIBUTED_C10D默认的ddp_timeout是1800s 尝试修改--ddp_timeout但无法生效,根据 issues/17106,由于torch的bug这个参数调整无效 只能尝试用单卡把tokenizer process跑完,然后存到cache,多卡执行时调用cache这个方法,但不太明白如何改动? 想问问大家有没有遇到这个问题,有啥好方法吗?

hiyouga commented 1 year ago

用 dataset streaming

acadaiaca commented 1 year ago

非常感谢项目作者!回答和解决问题都太快速了! 用 dataset streaming之后报了RuntimeError: Sizes of tensors must match except in dimension 0.的错,根据issues/463,需要加--dispatch_batches False,但transformers又报错没有这个参数,经查询发现这个参数是transformer的Bug,手动编译安装最新版4.33版transformer后,当前所有问题得到解决。 不过用streaming跑全量微调,感觉速度特别慢

JunZhan2000 commented 10 months ago

非常感谢项目作者!回答和解决问题都太快速了! 用 dataset streaming之后报了RuntimeError: Sizes of tensors must match except in dimension 0.的错,根据issues/463,需要加--dispatch_batches False,但transformers又报错没有这个参数,经查询发现这个参数是transformer的Bug,手动编译安装最新版4.33版transformer后,当前所有问题得到解决。 不过用streaming跑全量微调,感觉速度特别慢

hey兄弟,你是不是设置了NCCL_P2P_DISABLE=1?这个会导致训练特别慢,而不是streaming的原因

hiyouga commented 10 months ago

补充:现在支持单卡先用 --cache_path 保存预处理数据集,多卡下用 --cache_path 加载数据集

linchen111 commented 10 months ago

补充:现在支持单卡先用 --cache_path 保存预处理数据集,多卡下用 --cache_path 加载数据集

您好,那不是单卡后的多卡,不要设置overwrite_cache了

hiyouga commented 10 months ago

补充:现在支持单卡先用 --cache_path 保存预处理数据集,多卡下用 --cache_path 加载数据集

您好,那不是单卡后的多卡,不要设置overwrite_cache了

不影响

linchen111 commented 10 months ago

补充:现在支持单卡先用 --cache_path 保存预处理数据集,多卡下用 --cache_path 加载数据集

您好,那不是单卡后的多卡,不要设置overwrite_cache了

不影响

谢谢您的回复,那Overwrite the cached training and evaluation sets.这个主要是啥意思呢

EasonXiao-888 commented 8 months ago

tokenizer

您好 想问一下具体是如何先单卡process完数据集存到cache,再多卡运行,是process和train分开执行吗

hiyouga commented 8 months ago

@EasonXiao-888 运行脚本的方式和 readme 一样,只是参数的区别

shepherd233 commented 5 months ago

@EasonXiao-888 运行脚本的方式和 readme 一样,只是参数的区别

您好我想问下,我有大概100B的数据,想单卡预处理存在cache里,然后再多卡跑,但是每次预处理到4亿个example时就会报错:one of the subprocesses has abruptly died during map operation; 使用streaming模式比读取cache慢了一倍,有没有可以解决的好办法么,谢谢

seanzhang-zhichen commented 5 months ago

@EasonXiao-888 运行脚本的方式和 readme 一样,只是参数的区别

您好我想问下,我有大概100B的数据,想单卡预处理存在cache里,然后再多卡跑,但是每次预处理到4亿个example时就会报错:one of the subprocesses has abruptly died during map operation; 使用streaming模式比读取cache慢了一倍,有没有可以解决的好办法么,谢谢

我也有同样的问题

hiyouga commented 5 months ago

先在cpu上预处理数据 https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/lora_single_gpu/prepare.sh

JeffRody commented 5 months ago

你好我遇到相同的问题 Traceback (most recent call last): File "/public/home/wanglch/project/LLaMA-Factory/src/train_bash.py", line 14, in main() File "/public/home/wanglch/project/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/public/home/wanglch/project/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/public/home/wanglch/project/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 33, in run_sft dataset = get_dataset(model_args, data_args, training_args, stage="sft", tokenizer_module) File "/public/home/wanglch/project/LLaMA-Factory/src/llmtuner/data/loader.py", line 164, in get_dataset dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, kwargs) File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

JeffRody commented 5 months ago

deepspeed --hostfile=./hostfiles/hostfile-dl-$SLURM_JOB_ID /public/home/wanglch/project/LLaMA-Factory/src/train_bash.py \ --deepspeed /public/home/wanglch/project/LLaMA-Factory/examples/deepspeed/ds_z3_config.json \ --stage sft \ --do_train \ --model_name_or_path /public/home/wanglch/project/FinGPT/FinGPT_chatglm2-6b \ --dataset alpaca_zh \ --dataset_dir /public/home/wanglch/project/LLaMA-Factory/data \ --template chatglm3 \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir /public/home/wanglch/project/FinGPT/saves/FinGPT_ChatGLM2-6b/lora/sft \ --overwrite_cache \ --overwrite_output_dir \ --cutoff_len 1024 \ --preprocessing_num_workers 16 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 2 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --warmup_steps 20 \ --save_steps 100 \ --eval_steps 100 \ --evaluation_strategy steps \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --max_samples 3000 \ --val_size 0.1 \ --ddp_timeout 180000000 \ --plot_loss \ --fp16

JeffRody commented 5 months ago

您好,还遇到了这个报错。请教一下应该如何解决

Traceback (most recent call last): File "/public/home/wanglch/project/LLaMA-Factory/src/train_bash.py", line 14, in main() File "/public/home/wanglch/project/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/public/home/wanglch/project/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/public/home/wanglch/project/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 33, in run_sft dataset = get_dataset(model_args, data_args, training_args, stage="sft", tokenizer_module) File "/public/home/wanglch/project/LLaMA-Factory/src/llmtuner/data/loader.py", line 164, in get_dataset dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, kwargs) File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

EVEREST-dlk commented 3 months ago

您好,还遇到了这个报错。请教一下应该如何解决

Traceback (most recent call last): File "/public/home/wanglch/project/LLaMA-Factory/src/train_bash.py", line 14, in main() File "/public/home/wanglch/project/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/public/home/wanglch/project/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/public/home/wanglch/project/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 33, in run_sft dataset = get_dataset(model_args, data_args, training_args, stage="sft", tokenizer_module) File "/public/home/wanglch/project/LLaMA-Factory/src/llmtuner/data/loader.py", line 164, in get_dataset dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, kwargs) File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/public/home/wanglch/anaconda3/envs/factory/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

你好,我也遇到了这个问题,请问您是怎么解决的呢