vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.77k stars 4.26k forks source link

[Bug]: After 0.6.2 update to 0.6.3, INT8(W8A8) format cannot be loaded at all. No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75 #9419

Open HelloCard opened 3 days ago

HelloCard commented 3 days ago

Your current environment

The output of `python collect_env.py` ```text (base) root@DESKTOP-PEPA2G9:~# python collect_env.py Collecting environment information... /root/miniconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import __version__ as VLLM_VERSION PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti Nvidia driver version: 531.79 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz CPU family: 6 Model: 94 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 3 BogoMIPS: 8015.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 128 KiB (4 instances) L1i cache: 128 KiB (4 instances) L2 cache: 1 MiB (4 instances) L3 cache: 8 MiB (1 instance) Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Unknown: Dependent on hypervisor status Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.77 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.45.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A (dev) vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU0 X NV2 GPU1 NV2 X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

err_execute_model_input_20241016-194444.zip

🐛 Describe the bug

As the title says, after upgrading from 0.6.2 to 0.6.3, I can't load the W8A8 format model in my WSL2 environment:

(base) root@DESKTOP-PEPA2G9:/mnt/c/Windows/system32# python3 -m vllm.entrypoints.openai.api_server --model /mnt/e/Code/models/Orca-2-13b-W8A8 --max-model-len 4096 --tensor-parallel-size 2 --gpu-memory-utilization 0.73 --dtype=half
/root/miniconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
INFO 10-16 19:41:34 api_server.py:528] vLLM API server version dev
INFO 10-16 19:41:34 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/e/Code/models/Orca-2-13b-W8A8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.73, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-16 19:41:34 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/064ede48-6162-414f-aa8a-4e311be77be2 for IPC Path.
INFO 10-16 19:41:34 api_server.py:179] Started engine process with PID 3661
WARNING 10-16 19:41:34 config.py:1674] Casting torch.bfloat16 to torch.float16.
/root/miniconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
WARNING 10-16 19:41:37 config.py:1674] Casting torch.bfloat16 to torch.float16.
INFO 10-16 19:41:38 config.py:887] Defaulting to use mp for distributed inference
INFO 10-16 19:41:41 config.py:887] Defaulting to use mp for distributed inference
INFO 10-16 19:41:41 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='/mnt/e/Code/models/Orca-2-13b-W8A8', speculative_config=None, tokenizer='/mnt/e/Code/models/Orca-2-13b-W8A8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/e/Code/models/Orca-2-13b-W8A8, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-16 19:41:41 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 4 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-16 19:41:41 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
WARNING 10-16 19:41:41 utils.py:772] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorkerProcess pid=3737) WARNING 10-16 19:41:41 utils.py:772] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 10-16 19:41:41 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-16 19:41:41 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:41 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:41 selector.py:115] Using XFormers backend.
/root/miniconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=3737) /root/miniconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3737)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
/root/miniconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=3737) /root/miniconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3737)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:43 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
INFO 10-16 19:41:45 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:45 utils.py:1008] Found nccl from library libnccl.so.2
INFO 10-16 19:41:45 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:45 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-16 19:41:46 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 10-16 19:41:53 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:53 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 10-16 19:41:53 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f88618d4110>, local_subscribe_port=46945, remote_subscribe_port=None)
INFO 10-16 19:41:53 model_runner.py:1060] Starting to load model /mnt/e/Code/models/Orca-2-13b-W8A8...
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:53 model_runner.py:1060] Starting to load model /mnt/e/Code/models/Orca-2-13b-W8A8...
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:53 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-16 19:41:53 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:53 selector.py:115] Using XFormers backend.
INFO 10-16 19:41:53 selector.py:115] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [01:02<02:04, 62.27s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [02:05<01:02, 62.77s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [02:49<00:00, 54.15s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [02:49<00:00, 56.43s/it]

INFO 10-16 19:44:43 model_runner.py:1071] Loading model weights took 6.2693 GB
(VllmWorkerProcess pid=3737) INFO 10-16 19:44:44 model_runner.py:1071] Loading model weights took 6.2693 GB
ERROR 10-16 19:44:44 _custom_ops.py:53] Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
ERROR 10-16 19:44:44 _custom_ops.py:53] Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 _custom_ops.py:53] Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 _custom_ops.py:53] Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)
INFO 10-16 19:44:44 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241016-194444.pkl...
(VllmWorkerProcess pid=3737) INFO 10-16 19:44:44 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241016-194444.pkl...
(VllmWorkerProcess pid=3737) INFO 10-16 19:44:44 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241016-194444.pkl.
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error in model execution (input dumped to /tmp/err_execute_model_input_20241016-194444.pkl): Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building), Traceback (most recent call last):
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 45, in wrapper
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return fn(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 512, in cutlass_scaled_mm
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/_ops.py", line 1061, in __call__
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self_._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] NotImplementedError: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] Traceback (most recent call last):
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return func(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1665, in execute_model
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 556, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 345, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 257, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     hidden_states = self.self_attn(positions=positions,
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO 10-16 19:44:44 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241016-194444.pkl.
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 184, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     qkv, _ = self.qkv_proj(hidden_states)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 371, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 368, in apply
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return scheme.apply_weights(layer, x, bias=bias)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py", line 143, in apply_weights
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return apply_int8_linear(input=x,
(VllmWorkerProcess pid=3737) INFO 10-16 19:44:44 multiproc_worker_utils.py:242] Worker exiting
Process SpawnProcess-1:
INFO 10-16 19:44:44 multiproc_worker_utils.py:121] Killing local vLLM worker processes
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 45, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 512, in cutlass_scaled_mm
    torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1665, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 556, in forward
    model_output = self.model(input_ids, positions, kv_caches,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 345, in forward
    hidden_states, residual = layer(positions, hidden_states,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 257, in forward
    hidden_states = self.self_attn(positions=positions,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 184, in forward
    qkv, _ = self.qkv_proj(hidden_states)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 371, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 368, in apply
    return scheme.apply_weights(layer, x, bias=bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py", line 143, in apply_weights
    return apply_int8_linear(input=x,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 217, in apply_int8_linear
    return ops.cutlass_scaled_mm(x_q,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 54, in wrapper
    raise NotImplementedError(msg % (fn.__name__, e)) from e
NotImplementedError: Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 392, in run_mp_engine    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args
    return cls(
           ^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 349, in __init__
    self._initialize_kv_caches()
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
    num_blocks = self._run_workers("determine_num_available_blocks", )
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1309, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
    raise type(err)(
NotImplementedError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241016-194444.pkl): Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)
[rank0]:[W1016 19:44:44.083763338 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/root/miniconda3/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/root/miniconda3/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/root/miniconda3/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/root/miniconda3/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
/root/miniconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

HelloCard commented 3 days ago

I can still load unquantized bf16 format LLMs now, but I can no longer load W8A8 format LLMs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti      On | 00000000:01:00.0  On |                  N/A |
|  0%   25C    P8               23W / 300W|    360MiB / 22528MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti      On | 00000000:02:00.0 Off |                  N/A |
| 85%   25C    P8               23W / 300W|    360MiB / 22528MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        29      G   /Xwayland                                 N/A      |
|    1   N/A  N/A        29      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+
LucasWilkinson commented 2 days ago

@HelloCard Are you building vLLM locally? Looks like your collect_env.py is reporting no version

Edit: nvm this appears to be a known issue: https://github.com/vllm-project/vllm/issues/9421

HelloCard commented 1 day ago

@HelloCard Are you building vLLM locally? Looks like your collect_env.py is reporting no version

Edit: nvm this appears to be a known issue: #9421

Yes, no version is another issue and reported by others. I use pip install to upgrade vllm, so, the version is released 0.6.3.

LucasWilkinson commented 1 day ago

Fix: https://github.com/vllm-project/vllm/pull/9487