Closed paolovic closed 2 weeks ago
This bug is caused by transformers
GGUF integration, I think you had better open an issue in their repo as well.
Updated: the root issue is that the GGUFReader
in gguf
failed to read the checkpoint, seems that the checkpoint is corrupted.
BTW, vLLM hasn't supported loading from sharded GGUF files, you might need to merge them with gguf-split firstly.
Hi @Isotr0py ,
great that helped and did the trick.
I further had to specify the dtype
to half
but then it worked.
By the way: Do you know, if it's possible verify how many layers are loaded the GPU? In llama.cpp
we can set n_gpu_layers=-1
to ensure all layers are loaded to GPU, how can we do the same in vllm?
Thanks again and best regards!
In vllm, if you use the GPU backend (the normal installation), all layers will be loaded to the GPU without offloading to CPU.
alright, thank you very much @Isotr0py
Your current environment
The output of `python collect_env.py`
```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64) GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22) Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.28 Python version: 3.11.9 (main, Jun 19 2024, 10:02:06) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] (64-bit runtime) Python platform: Linux-4.18.0-553.16.1.el8_10.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40S-48C GPU 1: NVIDIA L40S-48C Nvidia driver version: 535.129.03 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.8.9.7 /usr/lib64/libcudnn.so.9.3.0 /usr/lib64/libcudnn_adv.so.9.3.0 /usr/lib64/libcudnn_adv_infer.so.8.9.7 /usr/lib64/libcudnn_adv_train.so.8.9.7 /usr/lib64/libcudnn_cnn.so.9.3.0 /usr/lib64/libcudnn_cnn_infer.so.8.9.7 /usr/lib64/libcudnn_cnn_train.so.8.9.7 /usr/lib64/libcudnn_engines_precompiled.so.9.3.0 /usr/lib64/libcudnn_engines_runtime_compiled.so.9.3.0 /usr/lib64/libcudnn_graph.so.9.3.0 /usr/lib64/libcudnn_heuristic.so.9.3.0 /usr/lib64/libcudnn_ops.so.9.3.0 /usr/lib64/libcudnn_ops_infer.so.8.9.7 /usr/lib64/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8462Y+ Stepping: 8 CPU MHz: 2799.999 BogoMIPS: 5599.99 Hypervisor vendor: VMware Virtualization type: full L1d cache: 48K L1i cache: 32K L2 cache: 2048K L3 cache: 61440K NUMA node0 CPU(s): 0-15 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear flush_l1d arch_capabilities Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] sentence-transformers==3.0.1 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX 0-15 0 N/A GPU1 PIX X 0-15 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```Model Input Dumps
No response
🐛 Describe the bug
Hi,
when calling the model Llama-3.1-Nemotron-70B-Instruct-HF-GGUF with the quantization Q6_K and the following args
I get this error
What am I doing wrong?
Before submitting a new issue...