vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.78k stars 4.49k forks source link

[Bug]: Loading GPTQ-quantized GPTBigCode fails in weight_loader_v2 of qptq_marlin #8116

Closed maxdebayser closed 2 months ago

maxdebayser commented 2 months ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Red Hat Enterprise Linux 9.4 (Plow) (x86_64) GCC version: (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.34 Python version: 3.11.7 (main, Jul 4 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (64-bit runtime) Python platform: Linux-4.18.0-372.46.1.el8_6.x86_64-x86_64-with-glibc2.34 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 535.104.05 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Vendor ID: GenuineIntel Model name: Intel Xeon Processor (Icelake) CPU family: 6 Model: 134 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 Stepping: 0 BogoMIPS: 5600.05 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 2.5 MiB (80 instances) L1i cache: 2.5 MiB (80 instances) L2 cache: 160 MiB (40 instances) L3 cache: 32 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-39 NUMA node1 CPU(s): 40-79 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] flashinfer==0.1.2+cu121torch2.4 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.20 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX 0-39 0 N/A NIC0 PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 ```

🐛 Describe the bug

When loading a gptbigcode model that has been quantized with gptq the loading fails and prints this stacktrace:

  File "/home/develop/.local/lib/python3.11/site-packages/vllm/model_executor/models/gpt_bigcode.py", line 356, in load_weights
    weight_loader(param, loaded_weight)
  File "/home/develop/.local/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 779, in weight_loader_v2
    self._load_fused_module_from_checkpoint(param, loaded_weight)
  File "/home/develop/.local/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 762, in _load_fused_module_from_checkpoint
    loaded_weight_shard = loaded_weight.narrow(param.output_dim,

The problem is that the marlin kernel is used ("The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel." appears in the log) and this kernel is using vLLMParameters since https://github.com/vllm-project/vllm/pull/7281.

Forcing the use of qptq instead of marlin with --quantization gptq allows us to load and run the model correctly because the equivalent change in GPTQ hasn't been merged yet (https://github.com/vllm-project/vllm/pull/7976). But someone else in our team tested this PR and got similar results.

The first parameter that fails to be loaded is transformer.h.0.attn.c_attn.g_idx

I've tried adding

elif type(param) is RowvLLMParameter:
    param.load_merged_column_weight(loaded_weight=loaded_weight)
    return

in QKVParallelLinear.weight_loader_v2() and that makes the problem go away, but I suspect that this isn't the correct fix. I'd appreciate some guidance on this to open a proper PR for this problem.

Before submitting a new issue...

maxdebayser commented 2 months ago

@dsikka @mgoin

mgoin commented 2 months ago

Thanks for reporting @maxdebayser! We'll look into this

For now I was able to replicate with a gpt_bigcode model:

vllm serve TheBloke/sqlcoder2-GPTQ

INFO 09-03 13:51:54 api_server.py:440] vLLM API server version 0.5.5
INFO 09-03 13:51:54 api_server.py:441] args: Namespace(model_tag='TheBloke/sqlcoder2-GPTQ', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='TheBloke/sqlcoder2-GPTQ', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fade64ef2e0>)
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.44k/1.44k [00:00<00:00, 16.0MB/s]
INFO 09-03 13:51:54 gptq_marlin.py:102] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 09-03 13:51:54 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/a4a90c49-10b8-4023-a6f8-758020d5e505 for RPC Path.
INFO 09-03 13:51:54 api_server.py:161] Started engine process with PID 3976355
INFO 09-03 13:51:57 gptq_marlin.py:102] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 09-03 13:51:57 llm_engine.py:210] Initializing an LLM engine (v0.5.5) with config: model='TheBloke/sqlcoder2-GPTQ', speculative_config=None, tokenizer='TheBloke/sqlcoder2-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=TheBloke/sqlcoder2-GPTQ, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.04k/4.04k [00:00<00:00, 52.0MB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 777k/777k [00:00<00:00, 6.19MB/s]
merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 442k/442k [00:00<00:00, 4.53MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.06M/2.06M [00:00<00:00, 13.0MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 532/532 [00:00<00:00, 9.34MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 1.90MB/s]
INFO 09-03 13:52:00 model_runner.py:906] Starting to load model TheBloke/sqlcoder2-GPTQ...
INFO 09-03 13:52:00 weight_utils.py:236] Using model weights format ['*.safetensors']
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.20G/9.20G [02:55<00:00, 52.5MB/s]
INFO 09-03 13:54:56 weight_utils.py:280] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mgoin/code/vllm/vllm/entrypoints/openai/rpc/server.py", line 230, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/home/mgoin/code/vllm/vllm/entrypoints/openai/rpc/server.py", line 31, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 726, in from_engine_args
    engine = cls(
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 617, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 826, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 261, in __init__
    super().__init__(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/engine/llm_engine.py", line 300, in __init__
    self.model_executor = executor_class(
  File "/home/mgoin/code/vllm/vllm/executor/executor_base.py", line 46, in __init__
    self._init_executor()
  File "/home/mgoin/code/vllm/vllm/executor/gpu_executor.py", line 39, in _init_executor
    self.driver_worker.load_model()
  File "/home/mgoin/code/vllm/vllm/worker/worker.py", line 182, in load_model
    self.model_runner.load_model()
  File "/home/mgoin/code/vllm/vllm/worker/model_runner.py", line 908, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/home/mgoin/code/vllm/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/home/mgoin/code/vllm/vllm/model_executor/model_loader/loader.py", line 344, in load_model
    model.load_weights(
  File "/home/mgoin/code/vllm/vllm/model_executor/models/gpt_bigcode.py", line 323, in load_weights
    weight_loader(param, loaded_weight)
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/linear.py", line 748, in weight_loader_v2
    self._load_fused_module_from_checkpoint(param, loaded_weight)
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/linear.py", line 731, in _load_fused_module_from_checkpoint
    loaded_weight_shard = loaded_weight.narrow(param.output_dim,
AttributeError: 'RowvLLMParameter' object has no attribute 'output_dim'
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

ERROR 09-03 13:54:59 api_server.py:171] RPCServer process died before responding to readiness probe
prashantgupta24 commented 2 months ago

Seems like this was fixed by https://github.com/vllm-project/vllm/pull/7976

mgoin commented 2 months ago

I think this makes sense, and is what we hope for during this weight loading refactor! Thanks for testing

maxdebayser commented 2 months ago

Thanks for fixing this so quickly, @dsikka