qwen2 7B 单机8卡 dpo报错

zhangfan-algo commented 4 weeks ago

Describe the bug Traceback (most recent call last): File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in rlhf_main() File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/utils/run_utils.py", line 22, in x_main args, remaining_argv = parse_args(args_class, argv) File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/utils/utils.py", line 131, in parse_args args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True) File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 212, in init File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/llm/utils/argument.py", line 1680, in post_init super().post_init() File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/swift/llm/utils/argument.py", line 1031, in __post_init__ raise ValueError('DeepSpeed is not compatible with MP. ' ValueError: DeepSpeed is not compatible with MP. n_gpu: 7, local_world_size: 1.

Your hardware and system info CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift rlhf --model_cache_dir Qwen2-7B-Instruct \ --model_type qwen2-7b-instruct \ --rlhf_type dpo \ --beta 0.1 \ --sft_beta 0.1 \ --sft_type lora \ --lora_target_modules ALL \ --init_lora_weights pissa \ --tuner_backend swift \ --template_type AUTO \ --ddp_backend nccl \ --custom_train_dataset_path dpo_zh_demo_format.jsonl \ --output_dir test_dpo2 \ --preprocess_num_proc 60 \ --dataloader_num_workers 60 \ --train_dataset_sample -1 \ --evaluation_strategy steps \ --eval_steps 200 \ --eval_batch_size 1 \ --dataset_test_ratio 0.01 \ --max_length 1024 \ --lr_scheduler_type cosine \ --num_train_epochs 5 \ --save_total_limit 5 \ --save_strategy epoch \ --logging_steps 10 \ --batch_size 2 \ --check_dataset_strategy warning \ --gradient_checkpointing true \ --gradient_accumulation_steps 1 \ --weight_decay 0.01 \ --learning_rate 1e-5 \ --deepspeed ds_z2_offload_config.json \ --max_grad_norm 0.5 \ --warmup_ratio 0.03 \ --use_flash_attn true \ --push_to_hub false \ --lazy_tokenize true \ --save_only_model false \ --save_on_each_node false \ --neftune_noise_alpha 5 \ --dtype AUTO

hjh0119 commented 4 weeks ago

DeepSpeed is not compatible with MP. n_gpu: 7, local_world_size: 1.

device map 和 deepspeed 不能一起用

zhangfan-algo commented 4 weeks ago

那在启动脚本里面要怎么设置一下呀

获取 Outlook for iOShttps://aka.ms/o0ukef

发件人: jinghanhu @.> 发送时间: Thursday, August 15, 2024 7:35:54 PM 收件人: modelscope/ms-swift @.> 抄送: zhangfan-algo @.>; Author @.> 主题: Re: [modelscope/ms-swift] qwen2 7B 单机8卡 dpo报错 (Issue #1715)

DeepSpeed is not compatible with MP. n_gpu: 7, local_world_size: 1.

device map 和 deepspeed 不能一起用

― Reply to this email directly, view it on GitHubhttps://github.com/modelscope/ms-swift/issues/1715#issuecomment-2291115005, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALMJFNEYCOPTZQQIGNQWNTLZRSHBVAVCNFSM6AAAAABMR4ZN5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRGEYTKMBQGU. You are receiving this because you authored the thread.Message ID: @.***>

hjh0119 commented 4 weeks ago

不用device map : 设置 NPROC_PER_NODE , 等于卡数, 走数据并行 or 单卡
不用deepspeed: 去掉 deepspeed 参数

zhangfan-algo commented 4 weeks ago

如果是多机多卡也是这样吗

获取 Outlook for iOShttps://aka.ms/o0ukef

发件人: jinghanhu @.> 发送时间: Thursday, August 15, 2024 8:25:35 PM 收件人: modelscope/ms-swift @.> 抄送: zhangfan-algo @.>; Author @.> 主题: Re: [modelscope/ms-swift] qwen2 7B 单机8卡 dpo报错 (Issue #1715)

不用device map : 设置 NPROC_PER_NODE , 等于卡数, 走数据并行 or 单卡
不用deepspeed: 去掉 deepspeed 参数

― Reply to this email directly, view it on GitHubhttps://github.com/modelscope/ms-swift/issues/1715#issuecomment-2291174614, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALMJFNHDUSCJV42LYXWTHC3ZRSM37AVCNFSM6AAAAABMR4ZN5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRGE3TINRRGQ. You are receiving this because you authored the thread.Message ID: @.***>

zhangfan-algo commented 4 weeks ago

num_nodes=8 num_gpu_per_node=8 node_rank=0 master_addr=0.0.0.0 master_port=6009

torchrun --nproc_per_node ${num_gpu_per_node} --master_port ${master_port} --master_addr ${master_addr} --node_rank ${node_rank} python examples/pytorch/llm/llm_rlhf.py --model_cache_dir Qwen2-7B-Instruct \ --model_type qwen2-7b-instruct \ --rlhf_type dpo \ --beta 0.1 \ --sft_beta 0.1 \ --sft_type lora \ --lora_target_modules ALL \ --init_lora_weights pissa \ --tuner_backend swift \ --template_type AUTO \ --ddp_backend nccl \ --custom_train_dataset_path dpo_zh_demo_format.jsonl \ --output_dir test_dpo2 \ --preprocess_num_proc 60 \ --dataloader_num_workers 60 \ --train_dataset_sample -1 \ --evaluation_strategy steps \ --eval_steps 200 \ --eval_batch_size 1 \ --dataset_test_ratio 0.01 \ --max_length 1024 \ --lr_scheduler_type cosine \ --num_train_epochs 5 \ --save_total_limit 5 \ --save_strategy epoch \ --logging_steps 10 \ --batch_size 2 \ --check_dataset_strategy warning \ --gradient_checkpointing true \ --gradient_accumulation_steps 1 \ --weight_decay 0.01 \ --learning_rate 1e-5 \ --max_grad_norm 0.5 \ --warmup_ratio 0.03 \ --use_flash_attn true \ --push_to_hub false \ --lazy_tokenize true \ --save_only_model false \ --save_on_each_node false \ --neftune_noise_alpha 5 \ --dtype AUTO

zhangfan-algo commented 4 weeks ago

W0816 09:11:15.782000 140379222689600 torch/distributed/run.py:779] W0816 09:11:15.782000 140379222689600 torch/distributed/run.py:779] W0816 09:11:15.782000 140379222689600 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0816 09:11:15.782000 140379222689600 torch/distributed/run.py:779] /apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory /apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory /apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory /apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory /apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory /apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory /apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory /apps1/zhangfan/anaconda3/envs/new_swift/bin/python: can't open file '/mnt/pfs/zhangfan/study_info/swift_0812/python': [Errno 2] No such file or directory E0816 09:11:15.897000 140379222689600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 2043206) of binary: /apps1/zhangfan/anaconda3/envs/new_swift/bin/python Traceback (most recent call last): File "/apps1/zhangfan/anaconda3/envs/new_swift/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')()) File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/apps1/zhangfan/anaconda3/envs/new_swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

zhangfan-algo commented 4 weeks ago

会提示找不到python的

modelscope / ms-swift

qwen2 7B 单机8卡 dpo报错 #1715