[Bug]: After 0.6.2 update to 0.6.3, INT8(W8A8) format cannot be loaded at all. No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75

HelloCard commented 3 days ago

Your current environment

The output of `python collect_env.py`

```text (base) root@DESKTOP-PEPA2G9:~# python collect_env.py Collecting environment information... /root/miniconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import __version__ as VLLM_VERSION PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti Nvidia driver version: 531.79 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz CPU family: 6 Model: 94 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 3 BogoMIPS: 8015.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 128 KiB (4 instances) L1i cache: 128 KiB (4 instances) L2 cache: 1 MiB (4 instances) L3 cache: 8 MiB (1 instance) Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Unknown: Dependent on hypervisor status Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.77 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.45.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A (dev) vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU0 X NV2 GPU1 NV2 X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

err_execute_model_input_20241016-194444.zip

🐛 Describe the bug

As the title says, after upgrading from 0.6.2 to 0.6.3, I can't load the W8A8 format model in my WSL2 environment:

(base) root@DESKTOP-PEPA2G9:/mnt/c/Windows/system32# python3 -m vllm.entrypoints.openai.api_server --model /mnt/e/Code/models/Orca-2-13b-W8A8 --max-model-len 4096 --tensor-parallel-size 2 --gpu-memory-utilization 0.73 --dtype=half
/root/miniconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
INFO 10-16 19:41:34 api_server.py:528] vLLM API server version dev
INFO 10-16 19:41:34 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/e/Code/models/Orca-2-13b-W8A8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.73, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-16 19:41:34 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/064ede48-6162-414f-aa8a-4e311be77be2 for IPC Path.
INFO 10-16 19:41:34 api_server.py:179] Started engine process with PID 3661
WARNING 10-16 19:41:34 config.py:1674] Casting torch.bfloat16 to torch.float16.
/root/miniconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
WARNING 10-16 19:41:37 config.py:1674] Casting torch.bfloat16 to torch.float16.
INFO 10-16 19:41:38 config.py:887] Defaulting to use mp for distributed inference
INFO 10-16 19:41:41 config.py:887] Defaulting to use mp for distributed inference
INFO 10-16 19:41:41 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='/mnt/e/Code/models/Orca-2-13b-W8A8', speculative_config=None, tokenizer='/mnt/e/Code/models/Orca-2-13b-W8A8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/e/Code/models/Orca-2-13b-W8A8, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-16 19:41:41 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 4 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-16 19:41:41 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
WARNING 10-16 19:41:41 utils.py:772] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorkerProcess pid=3737) WARNING 10-16 19:41:41 utils.py:772] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 10-16 19:41:41 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-16 19:41:41 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:41 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:41 selector.py:115] Using XFormers backend.
/root/miniconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=3737) /root/miniconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3737)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
/root/miniconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=3737) /root/miniconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3737)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:43 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
INFO 10-16 19:41:45 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:45 utils.py:1008] Found nccl from library libnccl.so.2
INFO 10-16 19:41:45 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:45 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-16 19:41:46 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 10-16 19:41:53 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:53 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 10-16 19:41:53 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f88618d4110>, local_subscribe_port=46945, remote_subscribe_port=None)
INFO 10-16 19:41:53 model_runner.py:1060] Starting to load model /mnt/e/Code/models/Orca-2-13b-W8A8...
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:53 model_runner.py:1060] Starting to load model /mnt/e/Code/models/Orca-2-13b-W8A8...
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:53 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-16 19:41:53 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=3737) INFO 10-16 19:41:53 selector.py:115] Using XFormers backend.
INFO 10-16 19:41:53 selector.py:115] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [01:02<02:04, 62.27s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [02:05<01:02, 62.77s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [02:49<00:00, 54.15s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [02:49<00:00, 56.43s/it]

INFO 10-16 19:44:43 model_runner.py:1071] Loading model weights took 6.2693 GB
(VllmWorkerProcess pid=3737) INFO 10-16 19:44:44 model_runner.py:1071] Loading model weights took 6.2693 GB
ERROR 10-16 19:44:44 _custom_ops.py:53] Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
ERROR 10-16 19:44:44 _custom_ops.py:53] Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 _custom_ops.py:53] Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 _custom_ops.py:53] Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)
INFO 10-16 19:44:44 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241016-194444.pkl...
(VllmWorkerProcess pid=3737) INFO 10-16 19:44:44 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241016-194444.pkl...
(VllmWorkerProcess pid=3737) INFO 10-16 19:44:44 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241016-194444.pkl.
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error in model execution (input dumped to /tmp/err_execute_model_input_20241016-194444.pkl): Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building), Traceback (most recent call last):
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 45, in wrapper
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return fn(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 512, in cutlass_scaled_mm
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/_ops.py", line 1061, in __call__
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self_._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] NotImplementedError: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231] Traceback (most recent call last):
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return func(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1665, in execute_model
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 556, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 345, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 257, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     hidden_states = self.self_attn(positions=positions,
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO 10-16 19:44:44 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241016-194444.pkl.
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 184, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     qkv, _ = self.qkv_proj(hidden_states)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 371, in forward
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 368, in apply
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return scheme.apply_weights(layer, x, bias=bias)
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py", line 143, in apply_weights
(VllmWorkerProcess pid=3737) ERROR 10-16 19:44:44 multiproc_worker_utils.py:231]     return apply_int8_linear(input=x,
(VllmWorkerProcess pid=3737) INFO 10-16 19:44:44 multiproc_worker_utils.py:242] Worker exiting
Process SpawnProcess-1:
INFO 10-16 19:44:44 multiproc_worker_utils.py:121] Killing local vLLM worker processes
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 45, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 512, in cutlass_scaled_mm
    torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1665, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 556, in forward
    model_output = self.model(input_ids, positions, kv_caches,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 345, in forward
    hidden_states, residual = layer(positions, hidden_states,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 257, in forward
    hidden_states = self.self_attn(positions=positions,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 184, in forward
    qkv, _ = self.qkv_proj(hidden_states)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 371, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 368, in apply
    return scheme.apply_weights(layer, x, bias=bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py", line 143, in apply_weights
    return apply_int8_linear(input=x,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 217, in apply_int8_linear
    return ops.cutlass_scaled_mm(x_q,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/_custom_ops.py", line 54, in wrapper
    raise NotImplementedError(msg % (fn.__name__, e)) from e
NotImplementedError: Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 392, in run_mp_engine    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args
    return cls(
           ^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 349, in __init__
    self._initialize_kv_caches()
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
    num_blocks = self._run_workers("determine_num_available_blocks", )
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1309, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
    raise type(err)(
NotImplementedError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241016-194444.pkl): Error in calling custom op cutlass_scaled_mm: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 75
Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)
[rank0]:[W1016 19:44:44.083763338 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/root/miniconda3/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/root/miniconda3/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/root/miniconda3/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/root/miniconda3/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
/root/miniconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

HelloCard commented 3 days ago

I can still load unquantized bf16 format LLMs now, but I can no longer load W8A8 format LLMs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti      On | 00000000:01:00.0  On |                  N/A |
|  0%   25C    P8               23W / 300W|    360MiB / 22528MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti      On | 00000000:02:00.0 Off |                  N/A |
| 85%   25C    P8               23W / 300W|    360MiB / 22528MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        29      G   /Xwayland                                 N/A      |
|    1   N/A  N/A        29      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

LucasWilkinson commented 2 days ago

@HelloCard Are you building vLLM locally? Looks like your collect_env.py is reporting no version

Edit: nvm this appears to be a known issue: https://github.com/vllm-project/vllm/issues/9421

HelloCard commented 1 day ago

@HelloCard Are you building vLLM locally? Looks like your collect_env.py is reporting no version

Edit: nvm this appears to be a known issue: #9421

Yes, no version is another issue and reported by others. I use pip install to upgrade vllm, so, the version is released 0.6.3.

LucasWilkinson commented 1 day ago

Fix: https://github.com/vllm-project/vllm/pull/9487

vllm-project / vllm