vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.49k stars 4.42k forks source link

[Installation]: pynvml.NVMLError_InvalidArgument: Invalid Argument #9865

Open jedi0605 opened 1 day ago

jedi0605 commented 1 day ago

Your current environment

[infxGPU Msg(52547:140064190512000:libvgpu.c:872)]: Initializing...
Collecting environment information...
[infxGPU Msg(52547:140064190512000:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(52547:140064190512000:hook.c:408)]: initial_virtual_map
/root/miniconda3/envs/myenv/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.27.6
Libc version: glibc-2.35

Python version: 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1067-kvm-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 NVL
Nvidia driver version: 550.90.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        40 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               64
On-line CPU(s) list:                  0-63
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC-Milan-v2 Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   1
Core(s) per socket:                   64
Socket(s):                            1
Stepping:                             1
BogoMIPS:                             5999.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku vaes vpclmulqdq rdpid fsrm
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            2 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             32 MiB (64 instances)
L3 cache:                             32 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-63
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.1
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.77                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.46.1                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post1
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-63    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How you are installing vllm

pip install -vvv vllm

Here is my nvidia-smi

(base) root@aitest2-6d68f7d84b-z6lm8:/workspace# nvidia-smi
[infxGPU Msg(51312:139710890829632:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(51312:139710890829632:hook.c:408)]: initial_virtual_map
[infxGPU Msg(51312:139710890829632:libvgpu.c:872)]: Initializing...
[infxGPU Msg(51312:139710890829632:device.c:248)]: driver version=12040
Thu Oct 31 02:58:03 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
[infxGPU Msg(51312:139710890829632:multiprocess_memory_limit.c:102)]: device core util limit set to 0, which means no limit: CUDA_DEVICE_SM_LIMIT=0
|   0  NVIDIA H100 NVL                Off |   00000000:00:02.0 Off |                    0 |
| N/A   41C    P0             65W /  400W |       0MiB /  20000MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[infxGPU Msg(51312:139710890829632:multiprocess_memory_limit.c:457)]: Calling exit handler 51312

I'm running "python -m vllm.entrypoints.openai.api_server". Got fail msg

 File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/pynvml.py", line 979, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_InvalidArgument: Invalid Argument"

Does vllm support CUDA 12.4 or I need to downgrade to 12.1?

Before submitting a new issue...

DarkLight1337 commented 1 day ago

Can you show the full stack trace? I think having CUDA 12.4 installed shouldn't be an issue as CUDA is backwards-compatible, and vLLM is compiled with CUDA 12.1 on PyPI. cc @youkaichao

jedi0605 commented 21 hours ago

Here is my full stack trace

(myenv) root@aitest2-6d68f7d84b-z6lm8:~/aitest# python -m vllm.entrypoints.openai.api_server --model ~/aitest/models/Qwen2-7B-Instruct --dtype auto --api-key 123456
[infxGPU Msg(159181:139717300419456:libvgpu.c:872)]: Initializing...
[infxGPU Msg(159181:139717300419456:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(159181:139717300419456:hook.c:408)]: initial_virtual_map
/root/miniconda3/envs/myenv/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
INFO 11-01 01:43:02 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 11-01 01:43:02 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='123456', lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/root/aitest/models/Qwen2-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 11-01 01:43:02 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/5139324d-4644-4dff-8f2a-6d5afd2538ae for IPC Path.
INFO 11-01 01:43:02 api_server.py:179] Started engine process with PID 159258
[infxGPU Msg(159258:139690778456576:libvgpu.c:872)]: Initializing...
[infxGPU Msg(159258:139690778456576:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(159258:139690778456576:hook.c:408)]: initial_virtual_map
/root/miniconda3/envs/myenv/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
WARNING 11-01 01:43:06 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-01 01:43:09 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-01 01:43:09 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/root/aitest/models/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/root/aitest/models/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/aitest/models/Qwen2-7B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/root/miniconda3/envs/myenv/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/miniconda3/envs/myenv/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
    return cls(
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 334, in __init__
    self.model_executor = executor_class(
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 38, in _init_executor
    self.driver_worker = self._create_worker()
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 105, in _create_worker
    return create_worker(**self._get_create_worker_kwargs(
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 24, in create_worker
    wrapper.init_worker(**kwargs)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 449, in init_worker
    self.worker = worker_class(*args, **kwargs)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 99, in __init__
    self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1012, in __init__
    self.attn_backend = get_attn_backend(
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/attention/selector.py", line 108, in get_attn_backend
    backend = which_attn_to_use(head_size, sliding_window, dtype,
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/attention/selector.py", line 222, in which_attn_to_use
    if not current_platform.has_device_capability(80):
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/platforms/interface.py", line 77, in has_device_capability
    current_capability = cls.get_device_capability(device_id=device_id)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 109, in get_device_capability
    major, minor = get_physical_device_capability(physical_device_id)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 52, in get_physical_device_capability
    return pynvml.nvmlDeviceGetCudaComputeCapability(handle)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/pynvml.py", line 2956, in nvmlDeviceGetCudaComputeCapability
    _nvmlCheckReturn(ret)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/pynvml.py", line 979, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_InvalidArgument: Invalid Argument
Traceback (most recent call last):
  File "/root/miniconda3/envs/myenv/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/myenv/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/root/miniconda3/envs/myenv/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/root/miniconda3/envs/myenv/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/root/miniconda3/envs/myenv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
DarkLight1337 commented 20 hours ago

Can you try updating your pynvml library?

jedi0605 commented 8 hours ago

@DarkLight1337 Thanks for your reply. After I upgraded pynvml. Got error below

root@aitest2-754d69d5f6-shw6p:/workspace# pip install --upgrade pynvmlLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: pynvml in /usr/local/lib/python3.10/dist-packages (11.4.1)
Collecting pynvml
  Downloading pynvml-11.5.3-py3-none-any.whl.metadata (8.8 kB)
Downloading pynvml-11.5.3-py3-none-any.whl (53 kB)
Installing collected packages: pynvml
  Attempting uninstall: pynvml
    Found existing installation: pynvml 11.4.1
    Uninstalling pynvml-11.4.1:
      Successfully uninstalled pynvml-11.4.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dask-cuda 23.8.0 requires pynvml<11.5,>=11.0.0, but you have pynvml 11.5.3 which is incompatible.
Successfully installed pynvml-11.5.3

And I follow the WARNING remove pynvml install nvidia-ml-py . But still not work

root@aitest2-754d69d5f6-shw6p:/workspace# vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
[infxGPU Msg(3049:140195243586560:libvgpu.c:872)]: Initializing...
[infxGPU Msg(3049:140195243586560:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(3049:140195243586560:hook.c:408)]: initial_virtual_map
WARNING 11-01 15:15:42 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.

Here is newest full stacktrack

root@aitest2-754d69d5f6-shw6p:/workspace# vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
[infxGPU Msg(3345:140509310895104:libvgpu.c:872)]: Initializing...
[infxGPU Msg(3345:140509310895104:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(3345:140509310895104:hook.c:408)]: initial_virtual_map
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
INFO 11-01 15:17:14 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 11-01 15:17:14 api_server.py:529] args: Namespace(subparser='serve', model_tag='NousResearch/Meta-Llama-3-8B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-abc123', lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='NousResearch/Meta-Llama-3-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7fc9ae0f5ab0>)
INFO 11-01 15:17:14 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/680acc58-8fea-4bd0-9e05-e44943778259 for IPC Path.
INFO 11-01 15:17:14 api_server.py:179] Started engine process with PID 3403
[infxGPU Msg(3403:140699640501248:libvgpu.c:872)]: Initializing...
[infxGPU Msg(3403:140699640501248:hook.c:400)]: loaded nvml libraries
[infxGPU Msg(3403:140699640501248:hook.c:408)]: initial_virtual_map
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
WARNING 11-01 15:17:21 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-01 15:17:22 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-01 15:17:22 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='NousResearch/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='NousResearch/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=NousResearch/Meta-Llama-3-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
    return cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 334, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 38, in _init_executor
    self.driver_worker = self._create_worker()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 105, in _create_worker
    return create_worker(**self._get_create_worker_kwargs(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 24, in create_worker
    wrapper.init_worker(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 449, in init_worker
    self.worker = worker_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 99, in __init__
    self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1012, in __init__
    self.attn_backend = get_attn_backend(
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 108, in get_attn_backend
    backend = which_attn_to_use(head_size, sliding_window, dtype,
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 222, in which_attn_to_use
    if not current_platform.has_device_capability(80):
  File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/interface.py", line 77, in has_device_capability
    current_capability = cls.get_device_capability(device_id=device_id)
  File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/cuda.py", line 109, in get_device_capability
    major, minor = get_physical_device_capability(physical_device_id)
  File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/cuda.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/cuda.py", line 52, in get_physical_device_capability
    return pynvml.nvmlDeviceGetCudaComputeCapability(handle)
  File "/usr/local/lib/python3.10/dist-packages/pynvml.py", line 2956, in nvmlDeviceGetCudaComputeCapability
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python3.10/dist-packages/pynvml.py", line 979, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_InvalidArgument: Invalid Argument
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 195, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 41, in serve
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
youkaichao commented 4 hours ago

[infxGPU Msg(3345:140509310895104:hook.c:400)]: loaded nvml libraries

what are these lines in the log? you might need to contact your admin.