vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.1k stars 3.82k forks source link

[Bug]: Can't load BNB model #6861

Open eldarkurtic opened 1 month ago

eldarkurtic commented 1 month ago

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-100-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 545.23.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             384
On-line CPU(s) list:                0-383
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9654 96-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 96
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3707.8120
CPU min MHz:                        1500.0000
BogoMIPS:                           4800.14
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          6 MiB (192 instances)
L1i cache:                          6 MiB (192 instances)
L2 cache:                           192 MiB (192 instances)
L3 cache:                           768 MiB (24 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-95,192-287
NUMA node1 CPU(s):                  96-191,288-383
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.2
[pip3] triton==2.3.1
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.1                    pypi_0    pypi
[conda] torchvision               0.18.1                   pypi_0    pypi
[conda] transformers              4.43.2                   pypi_0    pypi
[conda] triton                    2.3.1                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS SYS PIX SYS SYS SYS SYS SYS 0-95,192-287    0       N/A
GPU1    NV18     X  NV18    NV18    NV18    NV18    NV18    NV18    SYS SYS SYS PIX SYS SYS SYS SYS 0-95,192-287    0       N/A
GPU2    NV18    NV18     X  NV18    NV18    NV18    NV18    NV18    SYS PIX SYS SYS SYS SYS SYS SYS 0-95,192-287    0       N/A
GPU3    NV18    NV18    NV18     X  NV18    NV18    NV18    NV18    PIX SYS SYS SYS SYS SYS SYS SYS 0-95,192-287    0       N/A
GPU4    NV18    NV18    NV18    NV18     X  NV18    NV18    NV18    SYS SYS SYS SYS SYS SYS PIX SYS 96-191,288-383  1       N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X  NV18    NV18    SYS SYS SYS SYS SYS SYS SYS PIX 96-191,288-383  1       N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X  NV18    SYS SYS SYS SYS SYS PIX SYS SYS 96-191,288-383  1       N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X  SYS SYS SYS SYS PIX SYS SYS SYS 96-191,288-383  1       N/A
NIC0    SYS SYS SYS PIX SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS SYS SYS
NIC1    SYS SYS PIX SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS SYS
NIC2    PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS
NIC3    SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS
NIC4    SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS  X  SYS SYS SYS
NIC5    SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS  X  SYS SYS
NIC6    SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS
NIC7    SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

🐛 Describe the bug

I am trying to evaluate a BNB model (https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-BNB-NF4) through lm-evaluation-harness with vllm. This is the command I am running:

lm_eval \
  --model vllm \
  --model_args pretrained="hugging-quants/Meta-Llama-3.1-405B-Instruct-BNB-NF4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.9 \
  --tasks winogrande \
  --num_fewshot 5 \
  --batch_size 1

and I am seeing the following error (which I think is related to vllm):

WARNING 07-27 13:06:47 config.py:246] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 07-27 13:06:47 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/home/meta-llama/Meta-Llama-3.1-405B-Instruct-BNB-NF4', speculative_config=None, tokenizer='/home/meta-llama/Meta-Llama-3.1-405B-Instruct-BNB-NF4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=1234, served_model_name=/home/meta-llama/Meta-Llama-3.1-405B-Instruct-BNB-NF4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-27 13:06:51 model_runner.py:680] Starting to load model /home/meta-llama/Meta-Llama-3.1-405B-Instruct-BNB-NF4...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/bin/lm_eval", line 8, in <module>
[rank0]:     sys.exit(cli_evaluate())
[rank0]:              ^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
[rank0]:     results = evaluator.simple_evaluate(
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/evaluator.py", line 198, in simple_evaluate
[rank0]:     lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
[rank0]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/api/model.py", line 147, in create_from_arg_string
[rank0]:     return cls(**args, **args2)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/models/vllm_causallms.py", line 103, in __init__
[rank0]:     self.model = LLM(**self.model_args)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 155, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:                           ^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 682, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 280, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 109, in _initialize_model
[rank0]:     quant_config = _get_quantization_config(model_config, load_config)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 50, in _get_quantization_config
[rank0]:     quant_config = get_quant_config(model_config, load_config)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 130, in get_quant_config
[rank0]:     return quant_cls.from_config(hf_quant_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/bitsandbytes.py", line 52, in from_config
[rank0]:     adapter_name = cls.get_from_keys(config, ["adapter_name_or_path"])
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/base_config.py", line 87, in get_from_keys
[rank0]:     raise ValueError(f"Cannot find any of {keys} in the model's "
[rank0]: ValueError: Cannot find any of ['adapter_name_or_path'] in the model's quantization config.

I am not sure why vllm looks for adapter_name_or_path when the model is just a BNB-quantized to NF4.

r4dm commented 1 month ago

Same problem

jvlinsta commented 1 month ago

Most likely QLoRA is supported, whereas standard bnb quantization is not?