Closed zhangfan-algo closed 2 weeks ago
DeepSpeed is not compatible with MP. n_gpu: 7, local_world_size: 1.
device map 和 deepspeed 不能一起用
那在启动脚本里面要怎么设置一下呀
获取 Outlook for iOShttps://aka.ms/o0ukef
发件人: jinghanhu @.> 发送时间: Thursday, August 15, 2024 7:35:54 PM 收件人: modelscope/ms-swift @.> 抄送: zhangfan-algo @.>; Author @.> 主题: Re: [modelscope/ms-swift] qwen2 7B 单机8卡 dpo报错 (Issue #1715)
DeepSpeed is not compatible with MP. n_gpu: 7, local_world_size: 1.
device map 和 deepspeed 不能一起用
― Reply to this email directly, view it on GitHubhttps://github.com/modelscope/ms-swift/issues/1715#issuecomment-2291115005, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALMJFNEYCOPTZQQIGNQWNTLZRSHBVAVCNFSM6AAAAABMR4ZN5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRGEYTKMBQGU. You are receiving this because you authored the thread.Message ID: @.***>
如果是多机多卡也是这样吗
获取 Outlook for iOShttps://aka.ms/o0ukef
发件人: jinghanhu @.> 发送时间: Thursday, August 15, 2024 8:25:35 PM 收件人: modelscope/ms-swift @.> 抄送: zhangfan-algo @.>; Author @.> 主题: Re: [modelscope/ms-swift] qwen2 7B 单机8卡 dpo报错 (Issue #1715)
― Reply to this email directly, view it on GitHubhttps://github.com/modelscope/ms-swift/issues/1715#issuecomment-2291174614, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALMJFNHDUSCJV42LYXWTHC3ZRSM37AVCNFSM6AAAAABMR4ZN5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRGE3TINRRGQ. You are receiving this because you authored the thread.Message ID: @.***>
num_nodes=8 num_gpu_per_node=8 node_rank=0 master_addr=0.0.0.0 master_port=6009
torchrun --nproc_per_node ${num_gpu_per_node} --master_port ${master_port} --master_addr ${master_addr} --node_rank ${node_rank} python examples/pytorch/llm/llm_rlhf.py --model_cache_dir Qwen2-7B-Instruct \ --model_type qwen2-7b-instruct \ --rlhf_type dpo \ --beta 0.1 \ --sft_beta 0.1 \ --sft_type lora \ --lora_target_modules ALL \ --init_lora_weights pissa \ --tuner_backend swift \ --template_type AUTO \ --ddp_backend nccl \ --custom_train_dataset_path dpo_zh_demo_format.jsonl \ --output_dir test_dpo2 \ --preprocess_num_proc 60 \ --dataloader_num_workers 60 \ --train_dataset_sample -1 \ --evaluation_strategy steps \ --eval_steps 200 \ --eval_batch_size 1 \ --dataset_test_ratio 0.01 \ --max_length 1024 \ --lr_scheduler_type cosine \ --num_train_epochs 5 \ --save_total_limit 5 \ --save_strategy epoch \ --logging_steps 10 \ --batch_size 2 \ --check_dataset_strategy warning \ --gradient_checkpointing true \ --gradient_accumulation_steps 1 \ --weight_decay 0.01 \ --learning_rate 1e-5 \ --max_grad_norm 0.5 \ --warmup_ratio 0.03 \ --use_flash_attn true \ --push_to_hub false \ --lazy_tokenize true \ --save_only_model false \ --save_on_each_node false \ --neftune_noise_alpha 5 \ --dtype AUTO
W0816 09:11:15.782000 140379222689600 torch/distributed/run.py:779]
W0816 09:11:15.782000 140379222689600 torch/distributed/run.py:779]
W0816 09:11:15.782000 140379222689600 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0816 09:11:15.782000 140379222689600 torch/distributed/run.py:779]
/apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory
/apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory
/apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory
/apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory
/apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory
/apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory
/apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory
/apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory
E0816 09:11:15.897000 140379222689600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 2043206) of binary: /apps1/zhangfan/anaconda3/envs/new_swift/bin/python
Traceback (most recent call last):
File "/apps1/zhangfan/anaconda3/envs/new_swift/bin/torchrun", line 33, in
会提示找不到python的
Describe the bug Traceback (most recent call last): File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in
rlhf_main()
File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/utils/run_utils.py", line 22, in x_main
args, remaining_argv = parse_args(args_class, argv)
File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/utils/utils.py", line 131, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 212, in init
File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/llm/utils/argument.py", line 1680, in post_init
super().post_init()
File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/llm/utils/argument.py", line 1031, in __post_init__
raise ValueError('DeepSpeed is not compatible with MP. '
ValueError: DeepSpeed is not compatible with MP. n_gpu: 7, local_world_size: 1.
Your hardware and system info CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift rlhf --model_cache_dir Qwen2-7B-Instruct \ --model_type qwen2-7b-instruct \ --rlhf_type dpo \ --beta 0.1 \ --sft_beta 0.1 \ --sft_type lora \ --lora_target_modules ALL \ --init_lora_weights pissa \ --tuner_backend swift \ --template_type AUTO \ --ddp_backend nccl \ --custom_train_dataset_path dpo_zh_demo_format.jsonl \ --output_dir test_dpo2 \ --preprocess_num_proc 60 \ --dataloader_num_workers 60 \ --train_dataset_sample -1 \ --evaluation_strategy steps \ --eval_steps 200 \ --eval_batch_size 1 \ --dataset_test_ratio 0.01 \ --max_length 1024 \ --lr_scheduler_type cosine \ --num_train_epochs 5 \ --save_total_limit 5 \ --save_strategy epoch \ --logging_steps 10 \ --batch_size 2 \ --check_dataset_strategy warning \ --gradient_checkpointing true \ --gradient_accumulation_steps 1 \ --weight_decay 0.01 \ --learning_rate 1e-5 \ --deepspeed ds_z2_offload_config.json \ --max_grad_norm 0.5 \ --warmup_ratio 0.03 \ --use_flash_attn true \ --push_to_hub false \ --lazy_tokenize true \ --save_only_model false \ --save_on_each_node false \ --neftune_noise_alpha 5 \ --dtype AUTO