vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.06k stars 4.72k forks source link

[Usage]: Qwen2 GGUF model can't run successfully #7689

Closed QB-Chen closed 3 months ago

QB-Chen commented 3 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.22.4
Libc version: glibc-2.31

Python version: 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-100-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe
Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          112
On-line CPU(s) list:             0-111
Thread(s) per core:              2
Core(s) per socket:              28
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
Stepping:                        6
CPU MHz:                         804.528
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5200.00
Virtualization:                  VT-x
L1d cache:                       2.6 MiB
L1i cache:                       1.8 MiB
L2 cache:                        70 MiB
L3 cache:                        84 MiB
NUMA node0 CPU(s):               0-27,56-83
NUMA node1 CPU(s):               28-55,84-111
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.535.77
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.20
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pynvml==11.5.3
[pip3] pyzmq==24.0.1
[pip3] torch==2.4.0
[pip3] torch-tb-profiler==0.4.1
[pip3] torchaudio==2.0.2+cu118
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.0
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.535.77                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.20                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pynvml                    11.5.3                   pypi_0    pypi
[conda] pyzmq                     24.0.1                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.1                    pypi_0    pypi
[conda] torchaudio                2.0.2+cu118              pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.44.0                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    PXB     0-27,56-83      0               N/A
NIC0    NODE     X      NODE
NIC1    PXB     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

How would you like to use vllm

When I ran inference of a qwen2-72b-instruct-q2_k.gguf. I got an error,I don't know how to deal with:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/server.py", line 222, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/server.py", line 26, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
  File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
  File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 631, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 830, in _init_engine
    return engine_class(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 267, in __init__
    super().__init__(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/engine/llm_engine.py", line 268, in __init__
    self.model_executor = executor_class(
  File "/root/sspaas-fs/vllm/vllm/executor/executor_base.py", line 46, in __init__
    self._init_executor()
  File "/root/sspaas-fs/vllm/vllm/executor/gpu_executor.py", line 36, in _init_executor
    self.driver_worker.load_model()
  File "/root/sspaas-fs/vllm/vllm/worker/worker.py", line 151, in load_model
    self.model_runner.load_model()
  File "/root/sspaas-fs/vllm/vllm/worker/model_runner.py", line 891, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/root/sspaas-fs/vllm/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/root/sspaas-fs/vllm/vllm/model_executor/model_loader/loader.py", line 1034, in load_model
    model.load_weights(
  File "/root/sspaas-fs/vllm/vllm/model_executor/models/qwen2.py", line 437, in load_weights
    weight_loader(param, loaded_weight)
  File "/root/sspaas-fs/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py", line 376, in weight_loader
    assert loaded_weight.shape[output_dim] == self.org_vocab_size
AssertionError
Traceback (most recent call last):
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 150, in build_async_engine_client
    await async_engine_client.setup()
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/client.py", line 35, in setup
    await self.wait_for_server()
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/client.py", line 136, in wait_for_server
    await self._send_one_way_rpc_request(
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/client.py", line 112, in _send_one_way_rpc_request
    raise TimeoutError(f"server didn't reply within {timeout} ms")
TimeoutError: server didn't reply within 1000 ms

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 432, in <module>
    asyncio.run(run_server(args))
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 403, in run_server
    async with build_async_engine_client(args) as async_engine_client:
  File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 154, in build_async_engine_client
    raise RuntimeError(
RuntimeError: The server process died before responding to the readiness probe
Isotr0py commented 3 months ago

@QB-Chen This is possibly related to an issue about transformers gguf intergration instead of vllm implementation.

I have made a merged patch to fix it in transformers. Can you check if install transformers from newest source code would fix this?

QB-Chen commented 3 months ago

@QB-Chen This is possibly related to an issue about transformers gguf intergration instead of vllm implementation.

I have made a merged patch to fix it in transformers. Can you check if install transformers from newest source code would fix this?

A new problem

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/server.py", line 222, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/server.py", line 26, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
  File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
  File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 631, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 830, in _init_engine
    return engine_class(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/engine/async_llm_engine.py", line 267, in __init__
    super().__init__(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/engine/llm_engine.py", line 282, in __init__
    self._initialize_kv_caches()
  File "/root/sspaas-fs/vllm/vllm/engine/llm_engine.py", line 388, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/root/sspaas-fs/vllm/vllm/executor/gpu_executor.py", line 105, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/worker/worker.py", line 191, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/worker/model_runner.py", line 1107, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/worker/model_runner.py", line 1536, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/model_executor/models/qwen2.py", line 361, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/model_executor/models/qwen2.py", line 277, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/model_executor/models/utils.py", line 169, in forward
    output = functional_call(module,
  File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/functional_call.py", line 144, in functional_call
    return nn.utils.stateless._functional_call(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/utils/stateless.py", line 270, in _functional_call
    return module(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/model_executor/models/qwen2.py", line 210, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/model_executor/models/qwen2.py", line 154, in forward
    qkv, _ = self.qkv_proj(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/sspaas-fs/vllm/vllm/model_executor/layers/linear.py", line 359, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/root/sspaas-fs/vllm/vllm/model_executor/layers/quantization/gguf.py", line 134, in apply
    qweight_type = layer.qweight_type.weight_type
AttributeError: 'Tensor' object has no attribute 'weight_type'
Traceback (most recent call last):
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 150, in build_async_engine_client
    await async_engine_client.setup()
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/client.py", line 35, in setup
    await self.wait_for_server()
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/client.py", line 136, in wait_for_server
    await self._send_one_way_rpc_request(
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/rpc/client.py", line 112, in _send_one_way_rpc_request
    raise TimeoutError(f"server didn't reply within {timeout} ms")
TimeoutError: server didn't reply within 1000 ms

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 432, in <module>
    asyncio.run(run_server(args))
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 403, in run_server
    async with build_async_engine_client(args) as async_engine_client:
  File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/root/sspaas-fs/vllm/vllm/entrypoints/openai/api_server.py", line 154, in build_async_engine_client
    raise RuntimeError(
RuntimeError: The server process died before responding to the readiness probe
QB-Chen commented 3 months ago

When I cloned and installed the latest transformers(4.45.0.dev0) library locally, I re-ran the inference of Qwen2 GGUF's vllm and encountered this new issue.

AttributeError: 'Tensor' object has no attribute 'weight_type'
Isotr0py commented 3 months ago

Did other models like Qwen2-7B-GGUF/Qwen2-14B-GGUF also encounter this error? Or just 72B has this new issue? (72B is too large for me to reproduce this. And 7B should have same root issue with 72B, so we could reproduce this if this is related to the root issue.)

BTW, I can run the 7B inference with transformers(4.45.0.dev0) without any issue and the new issue above is very strange and shouldn't be encountered in most of cases:

qwen2-7b-instruct-q2_k.gguf: 100%|██████████████████████████████████████████████████████████████████████████████| 3.02G/3.02G [00:22<00:00, 134MB/s]
INFO 08-20 13:19:46 config.py:1552] Downcasting torch.float32 to torch.float16.
WARNING 08-20 13:19:46 config.py:312] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-20 13:19:46 llm_engine.py:182] Initializing an LLM engine (v0.5.4) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf, use_v2_block_manager=False, enable_prefix_caching=False)
/opt/conda/envs/vllm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
INFO 08-20 13:20:51 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-20 13:20:51 selector.py:116] Using XFormers backend.
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-20 13:20:53 model_runner.py:889] Starting to load model /root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf...
INFO 08-20 13:21:11 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-20 13:21:11 selector.py:116] Using XFormers backend.
INFO 08-20 13:21:30 model_runner.py:900] Loading model weights took 2.9129 GB
INFO 08-20 13:27:13 gpu_executor.py:113] # GPU blocks: 2847, # CPU blocks: 4681
Processed prompts: 100%|████████████████████████████████████████| 8/8 [00:15<00:00,  1.97s/it, est. speed input: 24.03 toks/s, output: 60.25 toks/s]
Prompt: '<|system|>\nYou are a friendly assistant chatbot.</s>\n<|user|>\nvLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n</s>\n<|assistant|>\n', Generated text: "Yes, that's correct! VLLM stands for Vectorized Large Language Model and it's designed specifically for inference and serving tasks involving Large Language Models (LLMs). It aims to provide high throughput and memory efficiency by leveraging vectorized operations and optimized memory management techniques.\n\nInference engines like VLLM are crucial for applications that require processing large amounts of text data quickly and efficiently, such as in natural language processing tasks like text generation, question answering, or sentiment analysis. By optimizing these tasks, VLLM can help improve the performance of AI systems deployed in various industries including tech companies, research institutions, and more.\n\nThe vectorized"
Prompt: '<|system|>\nYou are a friendly assistant chatbot.</s>\n<|user|>\nBriefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n</s>\n<|assistant|>\n', Generated text: "Artificial Intelligence (AI) development has seen significant milestones since its inception in the mid-twentieth century. Here are some major milestones:\n\n### Early Milestones (1950s)\n\n- **Origins**: AI was conceptualized around the mid-1950s with the creation of the first AI program by Alan Turing himself.\n\n### 1960s Milestones\n\n- **Early AI Programs**: Programs like the Logic Theorist were developed which could prove mathematical theorems using symbolic logic.\n\n### Late 1960s Milestones\n\n- **Early AI Failures**: AI's first major setback"
QB-Chen commented 3 months ago

Did other models like Qwen2-7B-GGUF/Qwen2-14B-GGUF also encounter this error? Or just 72B has this new issue? (72B is too large for me to reproduce this. And 7B should have same root issue with 72B, so we could reproduce this if this is related to the root issue.)Qwen2-7B-GGUF/Qwen2-14B-GGUF 等其他型号是否也遇到过此错误?或者只是 72B 有这个新问题?(72B 太大了,我无法重现这个。7B 应该与 72B 具有相同的根本问题,因此如果这与根本问题有关,我们可以重现此问题。

BTW, I can run the 7B inference with transformers(4.45.0.dev0) without any issue and the new issue above is very strange and shouldn't be encountered in most of cases:顺便说一句,我可以使用 transformers(4.45.0.dev0) 运行 7B 推理而不会出现任何问题,上面的新问题非常奇怪,在大多数情况下不应该遇到:

qwen2-7b-instruct-q2_k.gguf: 100%|██████████████████████████████████████████████████████████████████████████████| 3.02G/3.02G [00:22<00:00, 134MB/s]
INFO 08-20 13:19:46 config.py:1552] Downcasting torch.float32 to torch.float16.
WARNING 08-20 13:19:46 config.py:312] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-20 13:19:46 llm_engine.py:182] Initializing an LLM engine (v0.5.4) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf, use_v2_block_manager=False, enable_prefix_caching=False)
/opt/conda/envs/vllm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
INFO 08-20 13:20:51 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-20 13:20:51 selector.py:116] Using XFormers backend.
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-20 13:20:53 model_runner.py:889] Starting to load model /root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/qwen2-7b-instruct-q2_k.gguf...
INFO 08-20 13:21:11 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-20 13:21:11 selector.py:116] Using XFormers backend.
INFO 08-20 13:21:30 model_runner.py:900] Loading model weights took 2.9129 GB
INFO 08-20 13:27:13 gpu_executor.py:113] # GPU blocks: 2847, # CPU blocks: 4681
Processed prompts: 100%|████████████████████████████████████████| 8/8 [00:15<00:00,  1.97s/it, est. speed input: 24.03 toks/s, output: 60.25 toks/s]
Prompt: '<|system|>\nYou are a friendly assistant chatbot.</s>\n<|user|>\nvLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n</s>\n<|assistant|>\n', Generated text: "Yes, that's correct! VLLM stands for Vectorized Large Language Model and it's designed specifically for inference and serving tasks involving Large Language Models (LLMs). It aims to provide high throughput and memory efficiency by leveraging vectorized operations and optimized memory management techniques.\n\nInference engines like VLLM are crucial for applications that require processing large amounts of text data quickly and efficiently, such as in natural language processing tasks like text generation, question answering, or sentiment analysis. By optimizing these tasks, VLLM can help improve the performance of AI systems deployed in various industries including tech companies, research institutions, and more.\n\nThe vectorized"
Prompt: '<|system|>\nYou are a friendly assistant chatbot.</s>\n<|user|>\nBriefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n</s>\n<|assistant|>\n', Generated text: "Artificial Intelligence (AI) development has seen significant milestones since its inception in the mid-twentieth century. Here are some major milestones:\n\n### Early Milestones (1950s)\n\n- **Origins**: AI was conceptualized around the mid-1950s with the creation of the first AI program by Alan Turing himself.\n\n### 1960s Milestones\n\n- **Early AI Programs**: Programs like the Logic Theorist were developed which could prove mathematical theorems using symbolic logic.\n\n### Late 1960s Milestones\n\n- **Early AI Failures**: AI's first major setback"

My original command for running the program was as follows:

python -m vllm.entrypoints.openai.api_server --model qwen2-72b-instruct-q2_k.gguf --served-model-name qwen2-72b-instruct-q2_k --trust-remote-code --max_model_len 2048 --cpu_offload_gb 80 --quantization gguf

I found that the error occurred when I used --cpu_offload_gb 80. When I removed --cpu_offload_gb 80, it ran normally:

python -m vllm.entrypoints.openai.api_server --model qwen2-72b-instruct-q2_k.gguf --served-model-name qwen2-72b-instruct-q2_k --trust-remote-code --max_model_len 2048 --quantization gguf

Cool! Thank you~

QB-Chen commented 3 months ago

@QB-Chen This is possibly related to an issue about transformers gguf intergration instead of vllm implementation.

I have made a merged patch to fix it in transformers. Can you check if install transformers from newest source code would fix this?

I've found that the official qwen2-72b-instruct-q2_k.gguf model in ModelScope still throws an error with the assertion assert loaded_weight.shape[output_dim] == self.org_vocab_size, but when I run it with my own quantized Q2_K model, it works fine.

QB-Chen commented 3 months ago

I found that the error was caused by my transformers code not being updated to the latest version, which made the official modelscope model unable to be used. After updating, it worked just fine. 😂

lonngxiang commented 3 months ago

run sucess,but template how to create?

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', 'type': 'BadRequestError', 'param': None, 'code': 400}