modelscope / ms-swift

Use PEFT or Full-parameter to finetune 350+ LLMs or 90+ MLLMs. (Qwen2.5, GLM4v, Internlm2.5, Yi, Llama3.1, Llava-Video, Internvl2, MiniCPM-V-2.6, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
3.49k stars 299 forks source link

训练qwen vl时遇到RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.问题 #663

Closed sunyrain closed 5 months ago

sunyrain commented 5 months ago

脚本如下 CUDA_VISIBLE_DEVICES=0,1 swift sft --model_type qwen-vl-chat --custom_train_dataset_path bar_100k.json --lora_target_modules ALL --train_dataset_sample -1 --num_train_epochs 3 设备为双卡4090 可推理但无法微调,报错信息如下 root@autodl-container-33ab4c8e84-313ed797:~/autodl-tmp# CUDA_VISIBLE_DEVICES=0 swift sft --model_type qwen-vl-chat --custom_train_dataset_path bar_100k.json run sh: python /root/miniconda3/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen-vl-chat --custom_train_dataset_path bar_100k.json 2024-04-06 20:47:11,258 - modelscope - INFO - PyTorch version 2.1.2+cu121 Found. 2024-04-06 20:47:11,258 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-04-06 20:47:11,296 - modelscope - INFO - Loading done! Current index file version is 1.13.3, with md5 6dcfe7733d34883a96f5ce26ad8d6d2e and a total number of 972 components indexed [INFO:swift] Start time of running main: 2024-04-06 20:47:11.834846 [INFO:swift] Setting template_type: qwen [INFO:swift] Setting args.lazy_tokenize: False Traceback (most recent call last): File "/root/miniconda3/lib/python3.10/site-packages/swift/cli/sft.py", line 5, in sft_main() File "/root/miniconda3/lib/python3.10/site-packages/swift/utils/run_utils.py", line 25, in x_main args, remaining_argv = parse_args(args_class, argv) File "/root/miniconda3/lib/python3.10/site-packages/swift/utils/utils.py", line 98, in parse_args args, remaining_args = parser.parse_args_into_dataclasses( File "/root/miniconda3/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 134, in init File "/root/miniconda3/lib/python3.10/site-packages/swift/llm/utils/argument.py", line 447, in post_init__ self._init_training_args() File "/root/miniconda3/lib/python3.10/site-packages/swift/llm/utils/argument.py", line 472, in _init_training_args training_args = Seq2SeqTrainingArguments( File "", line 132, in init File "/root/miniconda3/lib/python3.10/site-packages/swift/trainers/arguments.py", line 44, in post_init super().post_init() File "/root/miniconda3/lib/python3.10/site-packages/transformers/training_args.py", line 1528, in __post_init and (self.device.type != "cuda") File "/root/miniconda3/lib/python3.10/site-packages/transformers/training_args.py", line 1995, in device return self._setup_devices File "/root/miniconda3/lib/python3.10/site-packages/transformers/utils/generic.py", line 56, in get cached = self.fget(obj) File "/root/miniconda3/lib/python3.10/site-packages/transformers/training_args.py", line 1931, in _setup_devices self.distributed_state = PartialState( File "/root/miniconda3/lib/python3.10/site-packages/accelerate/state.py", line 274, in init self.num_processes = torch.distributed.get_world_size() File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1492, in get_world_size return _get_group_size(group) File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 785, in _get_group_size default_pg = _get_default_group() File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 940, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

BUJIDAOVS commented 5 months ago

have same error too:

run sh: python /data/bujidao/Documents/Projects/swift/swift/swift-main/swift/cli/sft.py --model_type qwen1half-32b-chat --model_dir /home/liuyuyan/Documents/Projects/hf-download/Qwen1.5-32B-Chat --sft_type lora --tuner_backend swift --dtype AUTO --output_dir /home/liuyuyan/Documents/Projects/swift/qwen32b/output --custom_train_dataset_path /home/liuyuyan/Documents/Projects/swift/qwen32b/dataset/bujidao.jsonl /home/liuyuyan/Documents/Projects/swift/qwen32b/dataset/ruozhiba.jsonl --num_train_epochs 1 --max_length 4096 --check_dataset_strategy warning --lora_rank 8 --lora_alpha 32 --lora_dropout_p 0.05 --lora_target_modules DEFAULT --gradient_checkpointing true --batch_size 1 --weight_decay 0.1 --learning_rate 1e-4 --gradient_accumulation_steps 16 --max_grad_norm 0.5 --warmup_ratio 0.03 --eval_steps 100 --save_steps 100 --save_total_limit 2 --logging_steps 10 --use_flash_attn true 2024-04-06 23:45:35,790 - modelscope - INFO - PyTorch version 2.1.2 Found. 2024-04-06 23:45:35,791 - modelscope - INFO - Loading ast index from /home/liuyuyan/.cache/modelscope/ast_indexer 2024-04-06 23:45:35,821 - modelscope - INFO - Loading done! Current index file version is 1.13.3, with md5 1f502bd6f5e5b6e8d012ff16cac25ba4 and a total number of 972 components indexed [INFO:swift] Start time of running main: 2024-04-06 23:45:36.242233 [INFO:swift] Setting template_type: qwen [INFO:swift] Setting args.lazy_tokenize: False Traceback (most recent call last): File "/data/bujidao/Documents/Projects/swift/swift/swift-main/swift/cli/sft.py", line 5, in sft_main() File "/data/bujidao/Documents/Projects/swift/swift/swift-main/swift/utils/run_utils.py", line 25, in x_main args, remaining_argv = parse_args(args_class, argv) File "/data/bujidao/Documents/Projects/swift/swift/swift-main/swift/utils/utils.py", line 98, in parse_args args, remaining_args = parser.parse_args_into_dataclasses( File "/home/liuyuyan/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 137, in init File "/data/bujidao/Documents/Projects/swift/swift/swift-main/swift/llm/utils/argument.py", line 460, in post_init__ self._init_training_args() File "/data/bujidao/Documents/Projects/swift/swift/swift-main/swift/llm/utils/argument.py", line 485, in _init_training_args training_args = Seq2SeqTrainingArguments( File "", line 132, in init File "/data/bujidao/Documents/Projects/swift/swift/swift-main/swift/trainers/arguments.py", line 44, in post_init super().post_init() File "/home/liuyuyan/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/training_args.py", line 1528, in __post_init and (self.device.type != "cuda") File "/home/liuyuyan/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/training_args.py", line 1995, in device return self._setup_devices File "/home/liuyuyan/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/utils/generic.py", line 56, in get cached = self.fget(obj) File "/home/liuyuyan/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/training_args.py", line 1931, in _setup_devices self.distributed_state = PartialState( File "/home/liuyuyan/miniconda3/envs/swift/lib/python3.10/site-packages/accelerate/state.py", line 274, in init self.num_processes = torch.distributed.get_world_size() File "/home/liuyuyan/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1492, in get_world_size return _get_group_size(group) File "/home/liuyuyan/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 785, in _get_group_size default_pg = _get_default_group() File "/home/liuyuyan/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 940, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group

sunyrain commented 5 months ago

看了一下,在index file == 1.13.1的时候可以运行,现在下载的变成了1.13.3之后会报错

BUJIDAOVS commented 5 months ago

看了一下,在index file == 1.13.1的时候可以运行,现在下载的变成了1.13.3之后会报错

怎么回退呢,我直接用pip安装的还没更新qwen32b,用git克隆然后编译的就报这个错了

sunyrain commented 5 months ago

目前我也没什么方法,只是之前恰好有台机子开着。等一下官方回复吧。感觉像是localrank之类的问题。

robinzhang1210 commented 5 months ago

pip install accelerate==0.28.0 pip install altair==5.2.0 pip install contourpy==1.2.0 pip install fastapi==0.110.0 pip install fonttools==4.50.0 pip install gradio==4.24.0 pip install gradio_client==0.14.0 pip install httpcore==1.0.4 pip install huggingface-hub==0.22.1 pip install matplotlib==3.8.3 pip install modelscope==1.13.3 pip install networkx==3.2.1 pip install nvidia-nvjitlink-cu12==12.4.99 pip install orjson==3.9.15 pip install pillow==10.2.0 pip install protobuf==5.26.0 pip install pycparser==2.21 pip install ruff==0.3.4 pip install scipy==1.12.0 pip install starlette==0.36.3 pip install transformers==4.39.2 pip install typer==0.10.0 pip install typing_extensions==4.10.0 pip install tyro==0.7.3 pip install Werkzeug==3.0.1

这是我对比出来的,更新之后就可以用了,不过我也不知道具体是哪个影响的

sunyrain commented 5 months ago

谢谢!

trunks023 commented 5 months ago

@robinzhang1210 感谢,临时性的解决了问题。

sunyrain commented 5 months ago

今天下午似乎又寄了,在新开的一台机子上进行了如上库的安装,结果还是报了相同的错误。

sunyrain commented 5 months ago

欸,又好了,感觉是accelerate的问题,我从0.29到0.28好像就好了

robinzhang1210 commented 5 months ago

建议你先从正常的环境里保存一下依赖的信息吧。 pip freeze > lcc_requirements.txt

tastelikefeet commented 5 months ago

我们对accelerate的最新版本支持有点问题,今天已经修复了,拉下最新代码试一下

zqpossible commented 5 months ago

看了一下,在index file == 1.13.1的时候可以运行,现在下载的变成了1.13.3之后会报错

怎么回退呢,我直接用pip安装的还没更新qwen32b,用git克隆然后编译的就报这个错了

想问一下,现在qwen32b的更新了吗?swift平台或者sft代码都行

geekinglcq commented 3 months ago

我们对accelerate的最新版本支持有点问题,今天已经修复了,拉下最新代码试一下

仍然有这个问题,lora,单机 8 卡,Qwen-110B,版本如下: ms-swift 2.0.5.post1 accelerate 0.30.1