"AssertionError: DeepSpeed is not compatible with MP." while trying to train on 8 gpus with deepspeed.

wenmingwei commented 6 months ago

I try to modify train script examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/sft.sh in examples folder with more CUDA_VISIBLE_DEVICES

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

after this modification, function is_mp in swift/utils/torch_utils.py returned 'True', which caused this AssertionError.

def is_mp() -> bool:
    n_gpu = torch.cuda.device_count()
    local_world_size = get_dist_setting()[3]
    assert n_gpu % local_world_size == 0
    if n_gpu // local_world_size >= 2:
        return True
    return False

assert not is_mp(), 'DeepSpeed is not compatible with MP.'

I tried to update local_world_size with environment variable 'local_world_size', but local_world_size is reset to 1 by DeepSpeed

My question is: How to fine-tune a model with swift on GPUs more than 2 and DeepSpeed?

Jintao-Huang commented 6 months ago

Great question! I have organized the scripts for you. You can take a look here～ https://github.com/modelscope/swift?tab=readme-ov-file#training-scripts

bupengju commented 5 months ago

您好，按照上述示例脚本依然报错： assert not is_mp(), 'DeepSpeed is not compatible with MP.' swift版本已经是最新的了，请问需要在环境变量设置其他配置吗

bupengju commented 5 months ago

您好，按照上述示例脚本依然报错： assert not is_mp(), 'DeepSpeed is not compatible with MP.' swift版本已经是最新的了，请问需要在环境变量设置其他配置吗

一机多卡

hunter-xue commented 5 months ago

我也遇到了同样的问题，临时关闭deepspeed zero可以正常继续sft

Jintao-Huang commented 5 months ago

需要设置进程数和gpu数一致, 例如如果是8卡，那么就需要设置NPROC_PER_NODE为8

Jintao-Huang commented 5 months ago

参考这里的脚本 https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E5%BE%AE%E8%B0%83%E6%96%87%E6%A1%A3.md#%E5%BE%AE%E8%B0%83

9991711271011_ pic

hunter-xue commented 5 months ago

NPROC_PER_NODE=$nproc_per_node will cause below error on A10*4 GPU. swift version is: 1.7.3

(/ebs-data/venv) ubuntu@vm10-10-3-33:/ebs-data$ ./start-sft.sh run sh:torchrun --nproc_per_node 4 --master_port 29500 /ebs-data/swift/swift/cli/sft.py --model_id_or_path /ebs-data/Qwen1.5/Qwen1.5-14B-Chat --model_type qwen1half-14b-chat --custom_train_dataset_path /ebs-data/training-data/train_data.csv --check_model_is_latest false --model_revision master --sft_type lora --tuner_backend swift --template_type qwen --dtype AUTO --output_dir sft_output --train_dataset_sample -1 --num_train_epochs 3 --max_length 2048 --check_dataset_strategy warning --lora_rank 8 --lora_alpha 32 --lora_dropout_p 0.05 --lora_target_modules ALL --gradient_checkpointing true --batch_size 1 --weight_decay 0.1 --learning_rate 1e-4 --gradient_accumulation_steps 4 --max_grad_norm 0.5 --warmup_ratio 0.03 --eval_steps 100 --save_steps 100 --save_total_limit 2 --logging_steps 10 --use_flash_attn false --save_only_model true --ddp_backend nccl [2024-03-24 20:45:53,602] torch.distributed.run: [WARNING] [2024-03-24 20:45:53,602] torch.distributed.run: [WARNING] ***************************************** [2024-03-24 20:45:53,602] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-03-24 20:45:53,602] torch.distributed.run: [WARNING] ***************************************** Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "/ebs-data/swift/swift/cli/sft.py", line 2, in <module> File "/ebs-data/swift/swift/cli/sft.py", line 2, in <module> Traceback (most recent call last): File "/ebs-data/swift/swift/cli/sft.py", line 2, in <module> File "/ebs-data/swift/swift/cli/sft.py", line 2, in <module> from swift.llm import sft_main from swift.llm import sft_main ModuleNotFoundErrorModuleNotFoundError: No module named 'swift' from swift.llm import sft_main: from swift.llm import sft_main No module named 'swift' ModuleNotFoundError ModuleNotFoundError: : No module named 'swift'No module named 'swift'

skylyj commented 3 months ago

same question. 'DeepSpeed is not compatible with MP.'

leedewdew commented 3 weeks ago

NPROC_PER_NODE=4，我加上这个后好了

modelscope / ms-swift

"AssertionError: DeepSpeed is not compatible with MP." while trying to train on 8 gpus with deepspeed. #479