Open eldarkurtic opened 1 month ago
The output of `python collect_env.py` Collecting environment information... PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.35 Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-100-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.3.107 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3 Nvidia driver version: 545.23.08 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU max MHz: 3707.8120 CPU min MHz: 1500.0000 BogoMIPS: 4800.14 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d Virtualization: AMD-V L1d cache: 6 MiB (192 instances) L1i cache: 6 MiB (192 instances) L2 cache: 192 MiB (192 instances) L3 cache: 768 MiB (24 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-95,192-287 NUMA node1 CPU(s): 96-191,288-383 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.1 [pip3] torchvision==0.18.1 [pip3] transformers==4.43.2 [pip3] triton==2.3.1 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.3.1 pypi_0 pypi [conda] torchvision 0.18.1 pypi_0 pypi [conda] transformers 4.43.2 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS PIX SYS SYS SYS SYS SYS 0-95,192-287 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS PIX SYS SYS SYS SYS 0-95,192-287 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS PIX SYS SYS SYS SYS SYS SYS 0-95,192-287 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 PIX SYS SYS SYS SYS SYS SYS SYS 0-95,192-287 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX SYS 96-191,288-383 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS SYS PIX 96-191,288-383 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS PIX SYS SYS 96-191,288-383 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS PIX SYS SYS SYS 96-191,288-383 1 N/A NIC0 SYS SYS SYS PIX SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS NIC1 SYS SYS PIX SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS NIC2 PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS NIC3 SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS NIC4 SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS X SYS SYS SYS NIC5 SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS X SYS SYS NIC6 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS NIC7 SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7
I am trying to evaluate a BNB model (https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-BNB-NF4) through lm-evaluation-harness with vllm. This is the command I am running:
lm-evaluation-harness
vllm
lm_eval \ --model vllm \ --model_args pretrained="hugging-quants/Meta-Llama-3.1-405B-Instruct-BNB-NF4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.9 \ --tasks winogrande \ --num_fewshot 5 \ --batch_size 1
and I am seeing the following error (which I think is related to vllm):
WARNING 07-27 13:06:47 config.py:246] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 07-27 13:06:47 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/home/meta-llama/Meta-Llama-3.1-405B-Instruct-BNB-NF4', speculative_config=None, tokenizer='/home/meta-llama/Meta-Llama-3.1-405B-Instruct-BNB-NF4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=1234, served_model_name=/home/meta-llama/Meta-Llama-3.1-405B-Instruct-BNB-NF4, use_v2_block_manager=False, enable_prefix_caching=False) INFO 07-27 13:06:51 model_runner.py:680] Starting to load model /home/meta-llama/Meta-Llama-3.1-405B-Instruct-BNB-NF4... [rank0]: Traceback (most recent call last): [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/bin/lm_eval", line 8, in <module> [rank0]: sys.exit(cli_evaluate()) [rank0]: ^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate [rank0]: results = evaluator.simple_evaluate( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/evaluator.py", line 198, in simple_evaluate [rank0]: lm = lm_eval.api.registry.get_model(model).create_from_arg_string( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/api/model.py", line 147, in create_from_arg_string [rank0]: return cls(**args, **args2) [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/github/neuralmagic/lm-evaluation-harness/lm_eval/models/vllm_causallms.py", line 103, in __init__ [rank0]: self.model = LLM(**self.model_args) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 155, in __init__ [rank0]: self.llm_engine = LLMEngine.from_engine_args( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args [rank0]: engine = cls( [rank0]: ^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 251, in __init__ [rank0]: self.model_executor = executor_class( [rank0]: ^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__ [rank0]: self._init_executor() [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor [rank0]: self.driver_worker.load_model() [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model [rank0]: self.model_runner.load_model() [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 682, in load_model [rank0]: self.model = get_model(model_config=self.model_config, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 280, in load_model [rank0]: model = _initialize_model(model_config, self.load_config, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 109, in _initialize_model [rank0]: quant_config = _get_quantization_config(model_config, load_config) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 50, in _get_quantization_config [rank0]: quant_config = get_quant_config(model_config, load_config) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 130, in get_quant_config [rank0]: return quant_cls.from_config(hf_quant_config) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/bitsandbytes.py", line 52, in from_config [rank0]: adapter_name = cls.get_from_keys(config, ["adapter_name_or_path"]) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/eldar/miniconda3/envs/lmeval_llama31/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/base_config.py", line 87, in get_from_keys [rank0]: raise ValueError(f"Cannot find any of {keys} in the model's " [rank0]: ValueError: Cannot find any of ['adapter_name_or_path'] in the model's quantization config.
I am not sure why vllm looks for adapter_name_or_path when the model is just a BNB-quantized to NF4.
adapter_name_or_path
Same problem
Most likely QLoRA is supported, whereas standard bnb quantization is not?
Your current environment
🐛 Describe the bug
I am trying to evaluate a BNB model (https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-BNB-NF4) through
lm-evaluation-harness
withvllm
. This is the command I am running:and I am seeing the following error (which I think is related to vllm):
I am not sure why vllm looks for
adapter_name_or_path
when the model is just a BNB-quantized to NF4.