Open jedi0605 opened 1 day ago
Can you show the full stack trace? I think having CUDA 12.4 installed shouldn't be an issue as CUDA is backwards-compatible, and vLLM is compiled with CUDA 12.1 on PyPI. cc @youkaichao
Here is my full stack trace
(myenv) root@aitest2-6d68f7d84b-z6lm8:~/aitest# python -m vllm.entrypoints.openai.api_server --model ~/aitest/models/Qwen2-7B-Instruct --dtype auto --api-key 123456
[infxGPU Msg(159181:139717300419456:libvgpu.c:872)]: Initializing...
[infxGPU Msg(159181:139717300419456:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(159181:139717300419456:hook.c:408)]: initial_virtual_map
/root/miniconda3/envs/myenv/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
INFO 11-01 01:43:02 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 11-01 01:43:02 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='123456', lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/root/aitest/models/Qwen2-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 11-01 01:43:02 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/5139324d-4644-4dff-8f2a-6d5afd2538ae for IPC Path.
INFO 11-01 01:43:02 api_server.py:179] Started engine process with PID 159258
[infxGPU Msg(159258:139690778456576:libvgpu.c:872)]: Initializing...
[infxGPU Msg(159258:139690778456576:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(159258:139690778456576:hook.c:408)]: initial_virtual_map
/root/miniconda3/envs/myenv/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
WARNING 11-01 01:43:06 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-01 01:43:09 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-01 01:43:09 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/root/aitest/models/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/root/aitest/models/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/aitest/models/Qwen2-7B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/miniconda3/envs/myenv/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/miniconda3/envs/myenv/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args, **kwargs)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 334, in __init__
self.model_executor = executor_class(
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 38, in _init_executor
self.driver_worker = self._create_worker()
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 105, in _create_worker
return create_worker(**self._get_create_worker_kwargs(
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 24, in create_worker
wrapper.init_worker(**kwargs)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 449, in init_worker
self.worker = worker_class(*args, **kwargs)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 99, in __init__
self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1012, in __init__
self.attn_backend = get_attn_backend(
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/attention/selector.py", line 108, in get_attn_backend
backend = which_attn_to_use(head_size, sliding_window, dtype,
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/attention/selector.py", line 222, in which_attn_to_use
if not current_platform.has_device_capability(80):
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/platforms/interface.py", line 77, in has_device_capability
current_capability = cls.get_device_capability(device_id=device_id)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 109, in get_device_capability
major, minor = get_physical_device_capability(physical_device_id)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 41, in wrapper
return fn(*args, **kwargs)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 52, in get_physical_device_capability
return pynvml.nvmlDeviceGetCudaComputeCapability(handle)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/pynvml.py", line 2956, in nvmlDeviceGetCudaComputeCapability
_nvmlCheckReturn(ret)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/pynvml.py", line 979, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_InvalidArgument: Invalid Argument
Traceback (most recent call last):
File "/root/miniconda3/envs/myenv/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/myenv/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
uvloop.run(run_server(args))
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
File "/root/miniconda3/envs/myenv/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/root/miniconda3/envs/myenv/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
Can you try updating your pynvml
library?
@DarkLight1337 Thanks for your reply. After I upgraded pynvml. Got error below
root@aitest2-754d69d5f6-shw6p:/workspace# pip install --upgrade pynvmlLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: pynvml in /usr/local/lib/python3.10/dist-packages (11.4.1)
Collecting pynvml
Downloading pynvml-11.5.3-py3-none-any.whl.metadata (8.8 kB)
Downloading pynvml-11.5.3-py3-none-any.whl (53 kB)
Installing collected packages: pynvml
Attempting uninstall: pynvml
Found existing installation: pynvml 11.4.1
Uninstalling pynvml-11.4.1:
Successfully uninstalled pynvml-11.4.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dask-cuda 23.8.0 requires pynvml<11.5,>=11.0.0, but you have pynvml 11.5.3 which is incompatible.
Successfully installed pynvml-11.5.3
And I follow the WARNING remove pynvml
install nvidia-ml-py
. But still not work
root@aitest2-754d69d5f6-shw6p:/workspace# vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
[infxGPU Msg(3049:140195243586560:libvgpu.c:872)]: Initializing...
[infxGPU Msg(3049:140195243586560:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(3049:140195243586560:hook.c:408)]: initial_virtual_map
WARNING 11-01 15:15:42 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Here is newest full stacktrack
root@aitest2-754d69d5f6-shw6p:/workspace# vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
[infxGPU Msg(3345:140509310895104:libvgpu.c:872)]: Initializing...
[infxGPU Msg(3345:140509310895104:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(3345:140509310895104:hook.c:408)]: initial_virtual_map
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
INFO 11-01 15:17:14 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 11-01 15:17:14 api_server.py:529] args: Namespace(subparser='serve', model_tag='NousResearch/Meta-Llama-3-8B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-abc123', lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='NousResearch/Meta-Llama-3-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7fc9ae0f5ab0>)
INFO 11-01 15:17:14 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/680acc58-8fea-4bd0-9e05-e44943778259 for IPC Path.
INFO 11-01 15:17:14 api_server.py:179] Started engine process with PID 3403
[infxGPU Msg(3403:140699640501248:libvgpu.c:872)]: Initializing...
[infxGPU Msg(3403:140699640501248:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(3403:140699640501248:hook.c:408)]: initial_virtual_map
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
WARNING 11-01 15:17:21 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-01 15:17:22 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-01 15:17:22 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='NousResearch/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='NousResearch/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=NousResearch/Meta-Llama-3-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 334, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 38, in _init_executor
self.driver_worker = self._create_worker()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 105, in _create_worker
return create_worker(**self._get_create_worker_kwargs(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 24, in create_worker
wrapper.init_worker(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 449, in init_worker
self.worker = worker_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 99, in __init__
self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1012, in __init__
self.attn_backend = get_attn_backend(
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 108, in get_attn_backend
backend = which_attn_to_use(head_size, sliding_window, dtype,
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 222, in which_attn_to_use
if not current_platform.has_device_capability(80):
File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/interface.py", line 77, in has_device_capability
current_capability = cls.get_device_capability(device_id=device_id)
File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/cuda.py", line 109, in get_device_capability
major, minor = get_physical_device_capability(physical_device_id)
File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/cuda.py", line 41, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/cuda.py", line 52, in get_physical_device_capability
return pynvml.nvmlDeviceGetCudaComputeCapability(handle)
File "/usr/local/lib/python3.10/dist-packages/pynvml.py", line 2956, in nvmlDeviceGetCudaComputeCapability
_nvmlCheckReturn(ret)
File "/usr/local/lib/python3.10/dist-packages/pynvml.py", line 979, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_InvalidArgument: Invalid Argument
Traceback (most recent call last):
File "/usr/local/bin/vllm", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
[infxGPU Msg(3345:140509310895104:hook.c:400)]: loaded nvml libraries
what are these lines in the log? you might need to contact your admin.
Your current environment
How you are installing vllm
Here is my nvidia-smi
I'm running "python -m vllm.entrypoints.openai.api_server". Got fail msg
Does vllm support CUDA 12.4 or I need to downgrade to 12.1?
Before submitting a new issue...