Open Bihan opened 1 month ago
It is possible that issue 1 was caused by your system running out of file handles for 4096 http requests. If it has not been configured otherwise, the default number of open files on a linux system is 1024, which is very close to the 979 successful requests you had. Could you help confirm if the errors reported by the failed requests are on the lines of OSError 24 and check the number of open files in your system (ulimit -n)? If possible you can also try to increase the open files limit to some higher values (e.g. ulimit -n 8192) and see if this would help the rest of the requests to come through.
For issue 2, running with TP=1,2,4,8 all work fine for me. I'm on 8x MI250 GCDs. Could you share the error logs of the server failure you saw?
@kliuae Regarding the issue 2, below is the error log
root@ENC1-CLS01-SVR13:/workflow/vllm# ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m vllm.entrypoints.openai.api_server --model=meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size=2 --dtype=float16 --disable-log-requests --disable-frontend-multiprocessing
WARNING 10-07 12:58:19 rocm.py:13] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchvision-0.16.1+fdea156-py3.10-linux-x86_64.egg/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchvision-0.16.1+fdea156-py3.10-linux-x86_64.egg/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
INFO 10-07 12:58:22 api_server.py:527] vLLM API server version 0.6.3.dev116+g151ef4ef
INFO 10-07 12:58:22 api_server.py:528] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=True, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.1-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=True, max_log_len=None, disable_fastapi_docs=False)
WARNING 10-07 12:58:22 config.py:1646] Casting torch.bfloat16 to torch.float16.
INFO 10-07 12:58:33 config.py:875] Defaulting to use mp for distributed inference
INFO 10-07 12:58:33 config.py:904] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
WARNING 10-07 12:58:33 arg_utils.py:954] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 10-07 12:58:33 config.py:993] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 10-07 12:58:33 llm_engine.py:237] Initializing an LLM engine (v0.6.3.dev116+g151ef4ef) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=True multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 10-07 12:58:34 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 104 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-07 12:58:34 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 10-07 12:58:34 selector.py:121] Using ROCmFlashAttention backend.
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchvision-0.16.1+fdea156-py3.10-linux-x86_64.egg/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchvision-0.16.1+fdea156-py3.10-linux-x86_64.egg/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
(VllmWorkerProcess pid=6924) INFO 10-07 12:58:38 selector.py:121] Using ROCmFlashAttention backend.
(VllmWorkerProcess pid=6924) INFO 10-07 12:58:38 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
(VllmWorkerProcess pid=6924) INFO 10-07 12:58:38 utils.py:1005] Found nccl from library librccl.so.1
(VllmWorkerProcess pid=6924) INFO 10-07 12:58:38 pynccl.py:63] vLLM is using nccl==2.18.6
INFO 10-07 12:58:38 utils.py:1005] Found nccl from library librccl.so.1
INFO 10-07 12:58:38 pynccl.py:63] vLLM is using nccl==2.18.6
INFO 10-07 12:58:38 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x72f84cd84d60>, local_subscribe_port=44365, remote_subscribe_port=None)
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] Exception in worker VllmWorkerProcess while processing method init_device: tuple index out of range, Traceback (most recent call last):
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] File "/workflow/vllm/vllm/executor/multiproc_worker_utils.py", line 224, in _run_worker_process
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] File "/workflow/vllm/vllm/worker/worker.py", line 180, in init_device
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] set_random_seed(self.model_config.seed)
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] File "/workflow/vllm/vllm/model_executor/utils.py", line 10, in set_random_seed
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] seed_everything(seed)
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] File "/workflow/vllm/vllm/utils.py", line 393, in seed_everything
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] torch.cuda.manual_seed_all(seed)
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 127, in manual_seed_all
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] _lazy_call(cb, seed_all=True)
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] callable()
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 124, in cb
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] default_generator = torch.cuda.default_generators[i]
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231] IndexError: tuple index out of range
(VllmWorkerProcess pid=6924) ERROR 10-07 12:58:38 multiproc_worker_utils.py:231]
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/workflow/vllm/vllm/entrypoints/openai/api_server.py", line 581, in <module>
[rank0]: uvloop.run(run_server(args))
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
[rank0]: return loop.run_until_complete(wrapper())
[rank0]: File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
[rank0]: return await main
[rank0]: File "/workflow/vllm/vllm/entrypoints/openai/api_server.py", line 548, in run_server
[rank0]: async with build_async_engine_client(args) as engine_client:
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 199, in __aenter__
[rank0]: return await anext(self.gen)
[rank0]: File "/workflow/vllm/vllm/entrypoints/openai/api_server.py", line 106, in build_async_engine_client
[rank0]: async with build_async_engine_client_from_engine_args(
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 199, in __aenter__
[rank0]: return await anext(self.gen)
[rank0]: File "/workflow/vllm/vllm/entrypoints/openai/api_server.py", line 140, in build_async_engine_client_from_engine_args
[rank0]: engine_client = await asyncio.get_running_loop().run_in_executor(
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[rank0]: result = self.fn(*self.args, **self.kwargs)
[rank0]: File "/workflow/vllm/vllm/engine/async_llm_engine.py", line 674, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/workflow/vllm/vllm/engine/async_llm_engine.py", line 569, in __init__
[rank0]: self.engine = self._engine_class(*args, **kwargs)
[rank0]: File "/workflow/vllm/vllm/engine/async_llm_engine.py", line 265, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/workflow/vllm/vllm/engine/llm_engine.py", line 335, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: File "/workflow/vllm/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/workflow/vllm/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/workflow/vllm/vllm/executor/executor_base.py", line 47, in __init__
[rank0]: self._init_executor()
[rank0]: File "/workflow/vllm/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor
[rank0]: self._run_workers("init_device")
[rank0]: File "/workflow/vllm/vllm/executor/multiproc_gpu_executor.py", line 196, in _run_workers
[rank0]: ] + [output.get() for output in worker_outputs]
[rank0]: File "/workflow/vllm/vllm/executor/multiproc_gpu_executor.py", line 196, in <listcomp>
[rank0]: ] + [output.get() for output in worker_outputs]
[rank0]: File "/workflow/vllm/vllm/executor/multiproc_worker_utils.py", line 55, in get
[rank0]: raise self.result.exception
[rank0]: IndexError: tuple index out of range
root@ENC1-CLS01-SVR13:/workflow/vllm# /opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Your current environment
How would you like to use vllm
I am currently working on creating benchmarking metrics using 8 x MI300x GPUs and have encountered a couple of issues while using your framework. I wanted to bring these to your attention and seek your guidance.
Issue 1: Low Number of Successful Requests During Benchmarking
Configuration used to run the server:
Benchmarking configuration used:
Bench Marking Output
Out of 4,096 requests, only 979 were successful. I wonder whether this low success rate is due to the model's inability to respond to all requests or if it's an issue with the vLLM inference server. Could you please advise on potential causes or configurations that might improve the success rate?
Issue 2: Tensor Parallel Sizes Other Than 1 and 8 Not Working
When using
tensor-parallel-size=1
ortensor-parallel-size=8
, the server operates as expected.With tensor-parallel-size=8:
GPU Utilization (rocm-smi --showuse):
With tensor-parallel-size=1:
GPU Utilization (rocm-smi --showuse):
However, when attempting to use tensor-parallel-size=2, 4, or 6, the server fails to operate properly.
Is there a specific reason why only tensor parallel sizes of 1 and 8 are functioning? Are additional configurations required for other tensor parallel sizes?
Before submitting a new issue...