[Bug]: GGUF Llama-3.1-Nemotron-70B-Instruct-HF ValueError: cannot reshape array of size into shape

paolovic commented 3 weeks ago

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64) GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22) Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.28 Python version: 3.11.9 (main, Jun 19 2024, 10:02:06) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] (64-bit runtime) Python platform: Linux-4.18.0-553.16.1.el8_10.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40S-48C GPU 1: NVIDIA L40S-48C Nvidia driver version: 535.129.03 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.8.9.7 /usr/lib64/libcudnn.so.9.3.0 /usr/lib64/libcudnn_adv.so.9.3.0 /usr/lib64/libcudnn_adv_infer.so.8.9.7 /usr/lib64/libcudnn_adv_train.so.8.9.7 /usr/lib64/libcudnn_cnn.so.9.3.0 /usr/lib64/libcudnn_cnn_infer.so.8.9.7 /usr/lib64/libcudnn_cnn_train.so.8.9.7 /usr/lib64/libcudnn_engines_precompiled.so.9.3.0 /usr/lib64/libcudnn_engines_runtime_compiled.so.9.3.0 /usr/lib64/libcudnn_graph.so.9.3.0 /usr/lib64/libcudnn_heuristic.so.9.3.0 /usr/lib64/libcudnn_ops.so.9.3.0 /usr/lib64/libcudnn_ops_infer.so.8.9.7 /usr/lib64/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8462Y+ Stepping: 8 CPU MHz: 2799.999 BogoMIPS: 5599.99 Hypervisor vendor: VMware Virtualization type: full L1d cache: 48K L1i cache: 32K L2 cache: 2048K L3 cache: 61440K NUMA node0 CPU(s): 0-15 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear flush_l1d arch_capabilities Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] sentence-transformers==3.0.1 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX 0-15 0 N/A GPU1 PIX X 0-15 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

No response

🐛 Describe the bug

Hi,

when calling the model Llama-3.1-Nemotron-70B-Instruct-HF-GGUF with the quantization Q6_K and the following args

  args:
    default_max_tokens: 4096
    model: /models/Llama-3.1-Nemotron-70B-Instruct-HF-Q6_K/Llama-3.1-Nemotron-70B-Instruct-HF-Q6_K-00001-of-00002.gguf
    dtype: bfloat16
    quantization: gguf
    max_model_len: 8192
    tensor_parallel_size: 2
    enforce_eager: True
    gpu_memory_utilization: 0.8

I get this error

          The deployment failed to start 3 times in a row. This may be due to a problem with its constructor or initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
          ray::1 70B completions:vLLMGenericAPI.initialize_and_get_metadata() (pid=3098946, ip=159.103.253.75, actor_id=23d2fed59332cbed1efae01001000000, repr=<ray.serve._private.replica.ServeReplica:LLaMA 3.1 70B completions:vLLMGenericAPI object at 0x7f2fbc609fd0>)
            File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 449, in result
              return self.__get_result()
                     ^^^^^^^^^^^^^^^^^^^
            File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
              raise self._exception
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 631, in initialize_and_get_metadata
              raise RuntimeError(traceback.format_exc()) from None
          RuntimeError: Traceback (most recent call last):
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 609, in initialize_and_get_metadata
              await self._user_callable_wrapper.initialize_callable()
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 901, in initialize_callable
              await self._call_func_or_gen(
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 867, in _call_func_or_gen
              result = callable(*args, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/ray/serve/api.py", line 219, in __init__
              cls.__init__(self, *args, **kwargs)
            File "/u01/app/mlo/projects/llm-apis/ray_vllm_inference/vllm_serve.py", line 118, in __init__
              self.engine = AsyncLLMEngine.from_engine_args(args)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 664, in from_engine_args
              engine_config = engine_args.create_engine_config()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/vllm/engine/arg_utils.py", line 903, in create_engine_config
              model_config = self.create_model_config()
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/vllm/engine/arg_utils.py", line 839, in create_model_config
              return ModelConfig(
                     ^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/vllm/config.py", line 162, in __init__
              self.hf_config = get_config(self.model, trust_remote_code, revision,
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/vllm/transformers_utils/config.py", line 171, in get_config
              config_dict, _ = PretrainedConfig.get_config_dict(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/transformers/configuration_utils.py", line 570, in get_config_dict
              config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/transformers/configuration_utils.py", line 661, in _get_config_dict
              config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 83, in load_gguf_checkpoint
              reader = GGUFReader(gguf_checkpoint_path)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/gguf/gguf_reader.py", line 130, in __init__
              self._build_tensors(offs, tensors_fields)
            File "/tmp/runtime_resources/pip/123456789/virtualenv/lib64/python3.11/site-packages/gguf/gguf_reader.py", line 314, in _build_tensors
              data = self._get(data_offs, item_type, item_count).reshape(np_dims),
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ValueError: cannot reshape array of size 186342275 into shape (8192,23520)

What am I doing wrong?

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Isotr0py commented 3 weeks ago

~~This bug is caused by transformers GGUF integration, I think you had better open an issue in their repo as well.~~

Updated: the root issue is that the GGUFReader in gguf failed to read the checkpoint, seems that the checkpoint is corrupted.

BTW, vLLM hasn't supported loading from sharded GGUF files, you might need to merge them with gguf-split firstly.

paolovic commented 2 weeks ago

Hi @Isotr0py , great that helped and did the trick. I further had to specify the dtype to half but then it worked.

By the way: Do you know, if it's possible verify how many layers are loaded the GPU? In llama.cpp we can set n_gpu_layers=-1 to ensure all layers are loaded to GPU, how can we do the same in vllm?

Thanks again and best regards!

Isotr0py commented 2 weeks ago

In vllm, if you use the GPU backend (the normal installation), all layers will be loaded to the GPU without offloading to CPU.

paolovic commented 2 weeks ago

alright, thank you very much @Isotr0py

vllm-project / vllm