vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.75k stars 4.49k forks source link

[Bug]: The new version (v0.5.4) cannot load the gptq model, but the old version (vllm=0.5.3.post1) can do it. #7240

Closed ningwebbeginner closed 3 months ago

ningwebbeginner commented 3 months ago

Your current environment

--2024-08-07 03:22:15--  https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25132 (25K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py      100%[===================>]  24.54K  --.-KB/s    in 0.002s  

2024-08-07 03:22:15 (13.9 MB/s) - ‘collect_env.py’ saved [25132/25132]

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.85+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.00GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   1
Socket(s):                            1
Stepping:                             3
BogoMIPS:                             4000.29
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            32 KiB (1 instance)
L1i cache:                            32 KiB (1 instance)
L2 cache:                             1 MiB (1 instance)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0,1
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable; SMT Host state unknown
Vulnerability Meltdown:               Vulnerable
Vulnerability Mmio stale data:        Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable (Syscall hardening enabled)
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] optree==0.12.1
[pip3] pyzmq==24.0.1
[pip3] torch==2.3.1
[pip3] torchaudio==2.3.1+cu121
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.18.0
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.3
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  0-1     N/A     N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

import torch
from vllm import LLM, SamplingParams

# [Replace with the path to your GPTQ model](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4)
model_path = '/content/Qwen2-7B-Instruct-GPTQ-Int4'

# Initialize the LLM
llm = LLM(model=model_path, max_model_len=4096)

ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).
INFO 08-07 03:14:03 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-07 03:14:03 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/content/Qwen2-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/content/Qwen2-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/content/Qwen2-7B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-07 03:14:04 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-07 03:14:04 selector.py:54] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-07 03:14:06 model_runner.py:720] Starting to load model /content/Qwen2-7B-Instruct-GPTQ-Int4...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-5-3af6a28987e2>](https://localhost:8080/#) in <cell line: 8>()
      6 
      7 # Initialize the LLM
----> 8 llm = LLM(model=model_path, max_model_len=4096)
      9 
     10 # Set sampling parameters

14 frames
[/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py](https://localhost:8080/#) in __init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    156             **kwargs,
    157         )
--> 158         self.llm_engine = LLMEngine.from_engine_args(
    159             engine_args, usage_context=UsageContext.LLM_CLASS)
    160         self.request_counter = Counter()

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in from_engine_args(cls, engine_args, usage_context, stat_loggers)
    443         executor_class = cls._get_executor_cls(engine_config)
    444         # Create the LLM engine.
--> 445         engine = cls(
    446             **engine_config.to_dict(),
    447             executor_class=executor_class,

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in __init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers)
    247             self.model_config)
    248 
--> 249         self.model_executor = executor_class(
    250             model_config=model_config,
    251             cache_config=cache_config,

[/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py](https://localhost:8080/#) in __init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, prompt_adapter_config)
     45         self.prompt_adapter_config = prompt_adapter_config
     46 
---> 47         self._init_executor()
     48 
     49     @abstractmethod

[/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py](https://localhost:8080/#) in _init_executor(self)
     34         self.driver_worker = self._create_worker()
     35         self.driver_worker.init_device()
---> 36         self.driver_worker.load_model()
     37 
     38     def _get_worker_kwargs(

[/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py](https://localhost:8080/#) in load_model(self)
    137 
    138     def load_model(self):
--> 139         self.model_runner.load_model()
    140 
    141     def save_sharded_state(

[/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py](https://localhost:8080/#) in load_model(self)
    720         logger.info("Starting to load model %s...", self.model_config.model)
    721         with CudaMemoryProfiler() as m:
--> 722             self.model = get_model(model_config=self.model_config,
    723                                    device_config=self.device_config,
    724                                    load_config=self.load_config,

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py](https://localhost:8080/#) in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, multimodal_config, cache_config)
     19               cache_config: CacheConfig) -> nn.Module:
     20     loader = get_model_loader(load_config)
---> 21     return loader.load_model(model_config=model_config,
     22                              device_config=device_config,
     23                              lora_config=lora_config,

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in load_model(self, model_config, device_config, lora_config, multimodal_config, parallel_config, scheduler_config, cache_config)
    322         with set_default_torch_dtype(model_config.dtype):
    323             with target_device:
--> 324                 model = _initialize_model(model_config, self.load_config,
    325                                           lora_config, multimodal_config,
    326                                           cache_config, scheduler_config)

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in _initialize_model(model_config, load_config, lora_config, multimodal_config, cache_config, scheduler_config)
    150     """Initialize a model with the given configurations."""
    151     model_class = get_model_architecture(model_config)[0]
--> 152     quant_config = _get_quantization_config(model_config, load_config)
    153 
    154     return model_class(config=model_config.hf_config,

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in _get_quantization_config(model_config, load_config)
     91     """Get the quantization config."""
     92     if model_config.quantization is not None:
---> 93         quant_config = get_quant_config(model_config, load_config)
     94         capability = current_platform.get_device_capability()
     95         capability = capability[0] * 10 + capability[1]

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/weight_utils.py](https://localhost:8080/#) in get_quant_config(model_config, load_config)
    130                                   None)
    131     if hf_quant_config is not None:
--> 132         return quant_cls.from_config(hf_quant_config)
    133     # In case of bitsandbytes/QLoRA, get quant config from the adapter model.
    134     if model_config.quantization == "bitsandbytes":

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py](https://localhost:8080/#) in from_config(cls, config)
     82         lm_head_quantized = cls.get_from_keys_or(config, ["lm_head"],
     83                                                  default=False)
---> 84         return cls(weight_bits, group_size, desc_act, is_sym,
     85                    lm_head_quantized)
     86 

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py](https://localhost:8080/#) in __init__(self, weight_bits, group_size, desc_act, is_sym, lm_head_quantized)
     49 
     50         # Verify supported on platform.
---> 51         verify_marlin_supported(quant_type=self.quant_type,
     52                                 group_size=self.group_size)
     53 

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py](https://localhost:8080/#) in verify_marlin_supported(quant_type, group_size, has_zp)
     86     if not cond:
     87         assert err_msg is not None
---> 88         raise ValueError(err_msg)
     89 
     90 

ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).
mars-ch commented 3 months ago

+1

linpan commented 3 months ago

Marlin

Sk4467 commented 3 months ago

+1 Facing the same issue.

robertgshaw2-neuralmagic commented 3 months ago

@LucasWilkinson will take a look

robertgshaw2-neuralmagic commented 3 months ago

Explicitly setting quantization="gptq" should unblock you for now on a T4

We will look into the issue

12sang3 commented 3 months ago

+1,请问大家有什么好的解决方法吗

ningwebbeginner commented 3 months ago

+1,请问大家有什么好的解决方法吗

可以 pip install vllm==0.5.3.post1回到老版本,或者上面有人回复的设置 quantization="gptq" 如果你是用T4的话

robertgshaw2-neuralmagic commented 3 months ago

Closing because this is fixed by #7264

HelloCard commented 3 months ago

vllm [v0.5.4], shuyuej/Mistral-Nemo-Instruct-2407-GPTQ-INT8

(base) root@DESKTOP-O6DNFE1:/mnt/c/Windows/system32# CUDA_VISIBLE_DEVICES=1 python3 -m vllm.entrypoints.openai.api_server --model /mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8 --max-num-seqs=1 --max-model-len 8192 --gpu-memory-utilization 0.85
INFO 08-08 23:00:06 api_server.py:339] vLLM API server version 0.5.4
INFO 08-08 23:00:06 api_server.py:340] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.85, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-08 23:00:06 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-08 23:00:06 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-08 23:00:06 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8', speculative_config=None, tokenizer='/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-08 23:00:06 utils.py:578] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 08-08 23:00:07 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-08 23:00:07 selector.py:54] Using XFormers backend.
/root/miniconda3/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/root/miniconda3/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-08 23:00:08 model_runner.py:720] Starting to load model /mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8...
Process Process-1:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/miniconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
             ^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
    self.driver_worker.load_model()
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model
    model = _initialize_model(model_config, self.load_config,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 152, in _initialize_model
    quant_config = _get_quantization_config(model_config, load_config)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 93, in _get_quantization_config
    quant_config = get_quant_config(model_config, load_config)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 132, in get_quant_config
    return quant_cls.from_config(hf_quant_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 84, in from_config
    return cls(weight_bits, group_size, desc_act, is_sym,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 51, in __init__
    verify_marlin_supported(quant_type=self.quant_type,
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 88, in verify_marlin_supported
    raise ValueError(err_msg)
ValueError: Marlin does not support weight_bits = uint8b128. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).

add "--quantization gptq" and then OK.

12sang3 commented 3 months ago

p

您好,这是什么意思呢

dev1ous commented 2 months ago

Hello, still the same error on a T4 with 'neuralmagic/Mistral-Nemo-Instruct-2407-quantized.w4a16'