Open cooper12121 opened 6 months ago
Hi @cooper12121 you are trying to place 2 replicas that each use 16 GPUs with your current settings. Please try tensor_parallel=8, replica_num=2
. The tensor_parallel
value is per-model.
Hi @cooper12121 you are trying to place 2 replicas that each use 16 GPUs with your current settings. Please try
tensor_parallel=8, replica_num=2
. Thetensor_parallel
value is per-model.
there occurs another problem:
11.215.50.158: Traceback (most recent call last):
11.215.50.158: File "/usr/local/python/bin/deepspeed", line 6, in <module>
11.215.50.158: main()
11.215.50.158: File "/usr/local/python/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 426, in main
11.215.50.158: active_resources = parse_inclusion_exclusion(resource_pool, args.include, args.exclude)
11.215.50.158: File "/usr/local/python/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 350, in parse_inclusion_exclusion
11.215.50.158: return parse_resource_filter(active_resources, include_str=inclusion, exclude_str=exclusion)
11.215.50.158: File "/usr/local/python/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 299, in parse_resource_filter
11.215.50.158: raise ValueError(f"Hostname '{hostname}' not found in hostfile")
11.215.50.158: ValueError: Hostname '11.215.50.158' not found in hostfile
11.218.13.123: [2024-03-15 15:35:40,817] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.218.13.123: [2024-03-15 15:35:40,817] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: [2024-03-15 15:38:51,109] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: [2024-03-15 15:38:51,109] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.218.13.123: [2024-03-15 15:35:40,863] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.218.13.123: [2024-03-15 15:35:40,863] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: [2024-03-15 15:38:51,148] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
11.215.50.158: Detected CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
and this problem:
11.215.50.158: [2024-03-15 15:50:01,875] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: [2024-03-15 15:50:01,875] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
11.215.50.158: Traceback (most recent call last):
11.215.50.158: File "/apdcephfs_qy3/share_301372554/share_info/qianggao/Chinese-Mixtral/eval/eval1.py", line 421, in <module>
11.215.50.158: deepspeed_fastgen_deployment()
11.215.50.158: File "/apdcephfs_qy3/share_301372554/share_info/qianggao/Chinese-Mixtral/eval/eval1.py", line 318, in deepspeed_fastgen_deployment
11.215.50.158: client = mii.serve("/apdcephfs_qy3/share_301372554/share_info/qianggao/Chinese-Mixtral/model_output/train_instruct___bs_64_maxlen_2048_pad_right_lr_1e-6_format_Neo_t_single_turn_03-11",
11.215.50.158: File "/usr/local/python/lib/python3.8/site-packages/mii/api.py", line 155, in serve
11.215.50.158: import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init()
11.215.50.158: File "/tmp/mii_cache/train_instruct___bs_64_maxlen_2048_pad_right_lr_1e-6_format_Neo_t_single_turn_03-11-mii-deployment/score.py", line 33, in init
11.215.50.158: mii.backend.MIIServer(mii_config)
11.215.50.158: File "/usr/local/python/lib/python3.8/site-packages/mii/backend/server.py", line 47, in __init__
11.215.50.158: self._wait_until_server_is_live(processes,
11.215.50.158: File "/usr/local/python/lib/python3.8/site-packages/mii/backend/server.py", line 62, in _wait_until_server_is_live
11.215.50.158: raise RuntimeError(
11.215.50.158: RuntimeError: server crashed for some reason, unable to proceed
It doesn't seem to detect the other node,but my hostfile configuration is right, it works for training, and my environments alos detected all gpus:
can you help me?
Describe the bug
how can i fix this errors?