Closed ningwebbeginner closed 3 months ago
+1
Marlin
+1 Facing the same issue.
@LucasWilkinson will take a look
Explicitly setting quantization="gptq"
should unblock you for now on a T4
We will look into the issue
+1,请问大家有什么好的解决方法吗
+1,请问大家有什么好的解决方法吗
可以 pip install vllm==0.5.3.post1回到老版本,或者上面有人回复的设置 quantization="gptq" 如果你是用T4的话
Closing because this is fixed by #7264
vllm [v0.5.4], shuyuej/Mistral-Nemo-Instruct-2407-GPTQ-INT8
(base) root@DESKTOP-O6DNFE1:/mnt/c/Windows/system32# CUDA_VISIBLE_DEVICES=1 python3 -m vllm.entrypoints.openai.api_server --model /mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8 --max-num-seqs=1 --max-model-len 8192 --gpu-memory-utilization 0.85
INFO 08-08 23:00:06 api_server.py:339] vLLM API server version 0.5.4
INFO 08-08 23:00:06 api_server.py:340] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.85, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-08 23:00:06 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-08 23:00:06 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-08 23:00:06 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8', speculative_config=None, tokenizer='/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-08 23:00:06 utils.py:578] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 08-08 23:00:07 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-08 23:00:07 selector.py:54] Using XFormers backend.
/root/miniconda3/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/root/miniconda3/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-08 23:00:08 model_runner.py:720] Starting to load model /mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8...
Process Process-1:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/miniconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
self.driver_worker.load_model()
File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model
self.model_runner.load_model()
File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 722, in load_model
self.model = get_model(model_config=self.model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model
model = _initialize_model(model_config, self.load_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 152, in _initialize_model
quant_config = _get_quantization_config(model_config, load_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 93, in _get_quantization_config
quant_config = get_quant_config(model_config, load_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 132, in get_quant_config
return quant_cls.from_config(hf_quant_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 84, in from_config
return cls(weight_bits, group_size, desc_act, is_sym,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 51, in __init__
verify_marlin_supported(quant_type=self.quant_type,
File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 88, in verify_marlin_supported
raise ValueError(err_msg)
ValueError: Marlin does not support weight_bits = uint8b128. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).
add "--quantization gptq" and then OK.
p
您好,这是什么意思呢
Hello, still the same error on a T4 with 'neuralmagic/Mistral-Nemo-Instruct-2407-quantized.w4a16'
Your current environment
🐛 Describe the bug