Closed QB-Chen closed 3 months ago
@QB-Chen This is possibly related to an issue about transformers gguf intergration instead of vllm implementation.
I have made a merged patch to fix it in transformers
. Can you check if install transformers
from newest source code would fix this?
@QB-Chen This is possibly related to an issue about transformers gguf intergration instead of vllm implementation.
I have made a merged patch to fix it in
transformers
. Can you check if installtransformers
from newest source code would fix this?
A new problem
Process SpawnProcess-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/server.py", line 222, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/server.py", line 26, in __init__
self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
engine = cls(
File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 631, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 830, in _init_engine
return engine_class(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 267, in __init__
super().__init__(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/engine/llm_engine.py", line 282, in __init__
self._initialize_kv_caches()
File "/root/sspaas-fs/vllm/vllm/engine/llm_engine.py", line 388, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/root/sspaas-fs/vllm/vllm/executor/gpu_executor.py", line 105, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/worker/worker.py", line 191, in determine_num_available_blocks
self.model_runner.profile_run()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/worker/model_runner.py", line 1107, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/worker/model_runner.py", line 1536, in execute_model
hidden_or_intermediate_states = model_executable(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/model_executor/models/qwen2.py", line 361, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/model_executor/models/qwen2.py", line 277, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/model_executor/models/utils.py", line 169, in forward
output = functional_call(module,
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/functional_call.py", line 144, in functional_call
return nn.utils.stateless._functional_call(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/utils/stateless.py", line 270, in _functional_call
return module(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/model_executor/models/qwen2.py", line 210, in forward
hidden_states = self.self_attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/model_executor/models/qwen2.py", line 154, in forward
qkv, _ = self.qkv_proj(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/sspaas-fs/vllm/vllm/model_executor/layers/linear.py", line 359, in forward
output_parallel = self.quant_method.apply(self, input_, bias)
File "/root/sspaas-fs/vllm/vllm/model_executor/layers/quantization/gguf.py", line 134, in apply
qweight_type = layer.qweight_type.weight_type
AttributeError: 'Tensor' object has no attribute 'weight_type'
Traceback (most recent call last):
File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 150, in build_async_engine_client
await async_engine_client.setup()
File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/client.py", line 35, in setup
await self.wait_for_server()
File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/client.py", line 136, in wait_for_server
await self._send_one_way_rpc_request(
File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/client.py", line 112, in _send_one_way_rpc_request
raise TimeoutError(f"server didn't reply within {timeout} ms")
TimeoutError: server didn't reply within 1000 ms
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 432, in <module>
asyncio.run(run_server(args))
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 403, in run_server
async with build_async_engine_client(args) as async_engine_client:
File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 154, in build_async_engine_client
raise RuntimeError(
RuntimeError: The server process died before responding to the readiness probe
When I cloned and installed the latest transformers(4.45.0.dev0) library locally, I re-ran the inference of Qwen2 GGUF's vllm and encountered this new issue.
AttributeError: 'Tensor' object has no attribute 'weight_type'
Did other models like Qwen2-7B-GGUF/Qwen2-14B-GGUF also encounter this error? Or just 72B has this new issue? (72B is too large for me to reproduce this. And 7B should have same root issue with 72B, so we could reproduce this if this is related to the root issue.)
BTW, I can run the 7B inference with transformers(4.45.0.dev0) without any issue and the new issue above is very strange and shouldn't be encountered in most of cases:
qwen2-7b-instruct-q2_k.gguf: 100%|██████████████████████████████████████████████████████████████████████████████| 3.02G/3.02G [00:22<00:00, 134MB/s]
INFO 08-20 13:19:46 config.py:1552] Downcasting torch.float32 to torch.float16.
WARNING 08-20 13:19:46 config.py:312] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-20 13:19:46 llm_engine.py:182] Initializing an LLM engine (v0.5.4) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf, use_v2_block_manager=False, enable_prefix_caching=False)
/opt/conda/envs/vllm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
INFO 08-20 13:20:51 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-20 13:20:51 selector.py:116] Using XFormers backend.
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-20 13:20:53 model_runner.py:889] Starting to load model /root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf...
INFO 08-20 13:21:11 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-20 13:21:11 selector.py:116] Using XFormers backend.
INFO 08-20 13:21:30 model_runner.py:900] Loading model weights took 2.9129 GB
INFO 08-20 13:27:13 gpu_executor.py:113] # GPU blocks: 2847, # CPU blocks: 4681
Processed prompts: 100%|████████████████████████████████████████| 8/8 [00:15<00:00, 1.97s/it, est. speed input: 24.03 toks/s, output: 60.25 toks/s]
Prompt: '<|system|>\nYou are a friendly assistant chatbot.</s>\n<|user|>\nvLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n</s>\n<|assistant|>\n', Generated text: "Yes, that's correct! VLLM stands for Vectorized Large Language Model and it's designed specifically for inference and serving tasks involving Large Language Models (LLMs). It aims to provide high throughput and memory efficiency by leveraging vectorized operations and optimized memory management techniques.\n\nInference engines like VLLM are crucial for applications that require processing large amounts of text data quickly and efficiently, such as in natural language processing tasks like text generation, question answering, or sentiment analysis. By optimizing these tasks, VLLM can help improve the performance of AI systems deployed in various industries including tech companies, research institutions, and more.\n\nThe vectorized"
Prompt: '<|system|>\nYou are a friendly assistant chatbot.</s>\n<|user|>\nBriefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n</s>\n<|assistant|>\n', Generated text: "Artificial Intelligence (AI) development has seen significant milestones since its inception in the mid-twentieth century. Here are some major milestones:\n\n### Early Milestones (1950s)\n\n- **Origins**: AI was conceptualized around the mid-1950s with the creation of the first AI program by Alan Turing himself.\n\n### 1960s Milestones\n\n- **Early AI Programs**: Programs like the Logic Theorist were developed which could prove mathematical theorems using symbolic logic.\n\n### Late 1960s Milestones\n\n- **Early AI Failures**: AI's first major setback"
Did other models like Qwen2-7B-GGUF/Qwen2-14B-GGUF also encounter this error? Or just 72B has this new issue? (72B is too large for me to reproduce this. And 7B should have same root issue with 72B, so we could reproduce this if this is related to the root issue.)Qwen2-7B-GGUF/Qwen2-14B-GGUF 等其他型号是否也遇到过此错误?或者只是 72B 有这个新问题?(72B 太大了,我无法重现这个。7B 应该与 72B 具有相同的根本问题,因此如果这与根本问题有关,我们可以重现此问题。
BTW, I can run the 7B inference with transformers(4.45.0.dev0) without any issue and the new issue above is very strange and shouldn't be encountered in most of cases:顺便说一句,我可以使用 transformers(4.45.0.dev0) 运行 7B 推理而不会出现任何问题,上面的新问题非常奇怪,在大多数情况下不应该遇到:
qwen2-7b-instruct-q2_k.gguf: 100%|██████████████████████████████████████████████████████████████████████████████| 3.02G/3.02G [00:22<00:00, 134MB/s] INFO 08-20 13:19:46 config.py:1552] Downcasting torch.float32 to torch.float16. WARNING 08-20 13:19:46 config.py:312] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 08-20 13:19:46 llm_engine.py:182] Initializing an LLM engine (v0.5.4) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf, use_v2_block_manager=False, enable_prefix_caching=False) /opt/conda/envs/vllm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( INFO 08-20 13:20:51 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-20 13:20:51 selector.py:116] Using XFormers backend. /opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") INFO 08-20 13:20:53 model_runner.py:889] Starting to load model /root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf... INFO 08-20 13:21:11 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-20 13:21:11 selector.py:116] Using XFormers backend. INFO 08-20 13:21:30 model_runner.py:900] Loading model weights took 2.9129 GB INFO 08-20 13:27:13 gpu_executor.py:113] # GPU blocks: 2847, # CPU blocks: 4681 Processed prompts: 100%|████████████████████████████████████████| 8/8 [00:15<00:00, 1.97s/it, est. speed input: 24.03 toks/s, output: 60.25 toks/s] Prompt: '<|system|>\nYou are a friendly assistant chatbot.</s>\n<|user|>\nvLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n</s>\n<|assistant|>\n', Generated text: "Yes, that's correct! VLLM stands for Vectorized Large Language Model and it's designed specifically for inference and serving tasks involving Large Language Models (LLMs). It aims to provide high throughput and memory efficiency by leveraging vectorized operations and optimized memory management techniques.\n\nInference engines like VLLM are crucial for applications that require processing large amounts of text data quickly and efficiently, such as in natural language processing tasks like text generation, question answering, or sentiment analysis. By optimizing these tasks, VLLM can help improve the performance of AI systems deployed in various industries including tech companies, research institutions, and more.\n\nThe vectorized" Prompt: '<|system|>\nYou are a friendly assistant chatbot.</s>\n<|user|>\nBriefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n</s>\n<|assistant|>\n', Generated text: "Artificial Intelligence (AI) development has seen significant milestones since its inception in the mid-twentieth century. Here are some major milestones:\n\n### Early Milestones (1950s)\n\n- **Origins**: AI was conceptualized around the mid-1950s with the creation of the first AI program by Alan Turing himself.\n\n### 1960s Milestones\n\n- **Early AI Programs**: Programs like the Logic Theorist were developed which could prove mathematical theorems using symbolic logic.\n\n### Late 1960s Milestones\n\n- **Early AI Failures**: AI's first major setback"
My original command for running the program was as follows:
python -m vllm.entrypoints.openai.api_server --model qwen2-72b-instruct-q2_k.gguf --served-model-name qwen2-72b-instruct-q2_k --trust-remote-code --max_model_len 2048 --cpu_offload_gb 80 --quantization gguf
I found that the error occurred when I used --cpu_offload_gb 80. When I removed --cpu_offload_gb 80, it ran normally:
python -m vllm.entrypoints.openai.api_server --model qwen2-72b-instruct-q2_k.gguf --served-model-name qwen2-72b-instruct-q2_k --trust-remote-code --max_model_len 2048 --quantization gguf
Cool! Thank you~
@QB-Chen This is possibly related to an issue about transformers gguf intergration instead of vllm implementation.
I have made a merged patch to fix it in
transformers
. Can you check if installtransformers
from newest source code would fix this?
I've found that the official qwen2-72b-instruct-q2_k.gguf model in ModelScope still throws an error with the assertion assert loaded_weight.shape[output_dim] == self.org_vocab_size
, but when I run it with my own quantized Q2_K model, it works fine.
I found that the error was caused by my transformers code not being updated to the latest version, which made the official modelscope model unable to be used. After updating, it worked just fine. 😂
run sucess,but template how to create?
BadRequestError: Error code: 400 - {'object': 'error', 'message': 'As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', 'type': 'BadRequestError', 'param': None, 'code': 400}
Your current environment
How would you like to use vllm
When I ran inference of a qwen2-72b-instruct-q2_k.gguf. I got an error,I don't know how to deal with: