INFO 08-02 08:40:49 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/usr/local/models/llm', speculative_config=None, tokenizer='/usr/local/models/llm', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/usr/local/models/llm, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-02 08:40:49 model_runner.py:680] Starting to load model /usr/local/models/llm...
Cache shape torch.Size([163840, 64])
[2024-08-02 08:40:50,110] [WARN] /usr/local/api/chat_router.py(115):__init__: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
[2024-08-02 08:40:50,111] [WARN] /usr/local/api/chat_router.py(116):__init__: Traceback (most recent call last):
File "/usr/local/api/chat_router.py", line 107, in __init__
self.chat_client = ChatLocalVLLM.from_pretraind(model_path=llm_dir, NL="\n"
File "/usr/local/api/chat_models/chat_local_vllm.py", line 73, in from_pretraind
engine = cls._prepare_vllm(model_path, tensor_parallel_size
File "/usr/local/api/chat_models/chat_local_vllm.py", line 123, in _prepare_vllm
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in load_model
self.model = get_model(model_config=self.model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 280, in load_model
model = _initialize_model(model_config, self.load_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 111, in _initialize_model
return model_class(config=model_config.hf_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 439, in __init__
self.model = DeepseekV2Model(config, cache_config, quant_config)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 401, in __init__
self.layers = nn.ModuleList([
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 402, in <listcomp>
DeepseekV2DecoderLayer(config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 343, in __init__
self.mlp = DeepseekV2MLP(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 68, in __init__
self.down_proj = RowParallelLinear(intermediate_size,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 728, in __init__
self.quant_method.create_weights(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq.py", line 112, in create_weights
raise ValueError(
ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
Error(awq)
INFO 08-02 08:16:01 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/usr/local/models/llm', speculative_config=None, tokenizer='/usr/local/models/llm', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/usr/local/models/llm, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-02 08:16:01 model_runner.py:680] Starting to load model /usr/local/models/llm...
Cache shape torch.Size([163840, 64])
[2024-08-02 08:16:02,424] [WARN] /usr/local/api/chat_router.py(114):__init__: ERROR ChatModel not working
[2024-08-02 08:16:02,424] [WARN] /usr/local/api/chat_router.py(115):__init__: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
[2024-08-02 08:16:02,426] [WARN] /usr/local/api/chat_router.py(116):__init__: Traceback (most recent call last):
File "/usr/local/api/chat_router.py", line 107, in __init__
self.chat_client = ChatLocalVLLM.from_pretraind(model_path=llm_dir, NL="\n"
File "/usr/local/api/chat_models/chat_local_vllm.py", line 73, in from_pretraind
engine = cls._prepare_vllm(model_path, tensor_parallel_size
File "/usr/local/api/chat_models/chat_local_vllm.py", line 123, in _prepare_vllm
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in load_model
self.model = get_model(model_config=self.model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 280, in load_model
model = _initialize_model(model_config, self.load_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 111, in _initialize_model
return model_class(config=model_config.hf_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 439, in __init__
self.model = DeepseekV2Model(config, cache_config, quant_config)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 401, in __init__
self.layers = nn.ModuleList([
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 402, in <listcomp>
DeepseekV2DecoderLayer(config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 343, in __init__
self.mlp = DeepseekV2MLP(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 68, in __init__
self.down_proj = RowParallelLinear(intermediate_size,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 728, in __init__
self.quant_method.create_weights(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 148, in create_weights
verify_marlin_supports_shape(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 106, in verify_marlin_supports_shape
raise ValueError(f"Weight input_size_per_partition = "
ValueError: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
Your current environment
🐛 Describe the bug
DeepSeek-V2-Lite-gptq-4bit
andDeepSeek-Coder-V2-Lite-Instruct-AWQ
raise model shape Error.Repro
Error(gptq)
Error(awq)