microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.69k stars 4.04k forks source link

ValueError: Only able to place 0 replicas, but 2 replicas were requested. #5267

Open cooper12121 opened 6 months ago

cooper12121 commented 6 months ago

Describe the bug

i want to use deepspeed-fastgen for mixtral-instruct 8*7b inference on multi-node,my deployments are as follows:

import mii
client = mii.serve("path of mixtral-instruct 8*7b",
tensor_parallel=16,
replica_num=2,
hostfile="tmp/hostfile"
)
response = client.generate(["Deepspeed is", "Seattle is"], max_new_tokens=128)
print(response)

my hostfile like this:

ip1 slots=8
ip2 slots=8

my bash code like this:

deepspeed --hostfile ${TMP_DIR}/hostfile --master_addr ${MASTER_ADDR} --master_port=${MASTER_PORT}  ${WORKSPACE}/eval/eval1.py \
--deepspeed --deepspeed_config configs/ds_config_zero3.json

the config of bash is working for training I encountered the following error

  • ValueError: Only able to place 0 replicas, but 2 replicas were requested.

how can i fix this errors?

mrwyattii commented 6 months ago

Hi @cooper12121 you are trying to place 2 replicas that each use 16 GPUs with your current settings. Please try tensor_parallel=8, replica_num=2. The tensor_parallel value is per-model.

cooper12121 commented 6 months ago

Hi @cooper12121 you are trying to place 2 replicas that each use 16 GPUs with your current settings. Please try tensor_parallel=8, replica_num=2. The tensor_parallel value is per-model.

there occurs another problem:

11.215.50.158: Traceback (most recent call last):
11.215.50.158:   File "/usr/local/python/bin/deepspeed", line 6, in <module>
11.215.50.158:     main()
11.215.50.158:   File "/usr/local/python/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 426, in main
11.215.50.158:     active_resources = parse_inclusion_exclusion(resource_pool, args.include, args.exclude)
11.215.50.158:   File "/usr/local/python/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 350, in parse_inclusion_exclusion
11.215.50.158:     return parse_resource_filter(active_resources, include_str=inclusion, exclude_str=exclusion)
11.215.50.158:   File "/usr/local/python/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 299, in parse_resource_filter
11.215.50.158:     raise ValueError(f"Hostname '{hostname}' not found in hostfile")
11.215.50.158: ValueError: Hostname '11.215.50.158' not found in hostfile
11.218.13.123: [2024-03-15 15:35:40,817] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.218.13.123: [2024-03-15 15:35:40,817] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: [2024-03-15 15:38:51,109] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: [2024-03-15 15:38:51,109] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.218.13.123: [2024-03-15 15:35:40,863] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.218.13.123: [2024-03-15 15:35:40,863] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: [2024-03-15 15:38:51,148] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
11.215.50.158: Detected CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.

and this problem:

11.215.50.158: [2024-03-15 15:50:01,875] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: [2024-03-15 15:50:01,875] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: Traceback (most recent call last):
11.215.50.158:   File "/apdcephfs_qy3/share_301372554/share_info/qianggao/Chinese-Mixtral/eval/eval1.py", line 421, in <module>
11.215.50.158:     deepspeed_fastgen_deployment()
11.215.50.158:   File "/apdcephfs_qy3/share_301372554/share_info/qianggao/Chinese-Mixtral/eval/eval1.py", line 318, in deepspeed_fastgen_deployment
11.215.50.158:     client = mii.serve("/apdcephfs_qy3/share_301372554/share_info/qianggao/Chinese-Mixtral/model_output/train_instruct___bs_64_maxlen_2048_pad_right_lr_1e-6_format_Neo_t_single_turn_03-11",
11.215.50.158:   File "/usr/local/python/lib/python3.8/site-packages/mii/api.py", line 155, in serve
11.215.50.158:     import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init()
11.215.50.158:   File "/tmp/mii_cache/train_instruct___bs_64_maxlen_2048_pad_right_lr_1e-6_format_Neo_t_single_turn_03-11-mii-deployment/score.py", line 33, in init
11.215.50.158:     mii.backend.MIIServer(mii_config)
11.215.50.158:   File "/usr/local/python/lib/python3.8/site-packages/mii/backend/server.py", line 47, in __init__
11.215.50.158:     self._wait_until_server_is_live(processes,
11.215.50.158:   File "/usr/local/python/lib/python3.8/site-packages/mii/backend/server.py", line 62, in _wait_until_server_is_live
11.215.50.158:     raise RuntimeError(
11.215.50.158: RuntimeError: server crashed for some reason, unable to proceed

It doesn't seem to detect the other node,but my hostfile configuration is right, it works for training, and my environments alos detected all gpus:

image

can you help me?