[Bug]: issue with Phi3 mini GPTQ 4Bit/8Bit

Your current environment

PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.26

Python version: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] (64-bit runtime)
Python platform: Linux-5.10.215-203.850.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: N/A
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7R13 Processor
Stepping:            1
CPU MHz:             3655.646
BogoMIPS:            5300.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid

Versions of relevant libraries:
[pip3] numpy==1.26.4
[conda] numpy                     1.26.4          py310hb13e2d6_0    conda-forge
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-3     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

quantization of Phi3 mini as 4Bit or 8Bit, but none of them worked with vllm 0.5.1

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

model_id = "microsoft/Phi-3-mini-4k-instruct"

quantization_config = GPTQConfig(
     bits=4,
     group_size=128,
     dataset="c4",
     desc_act=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map='auto')

if load with transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained(
    "kaitchup/Phi-3-mini-4k-instruct-gptq-4bit",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("kaitchup/Phi-3-mini-4k-instruct-gptq-4bit")

if you take a look inside:

# model.model.layers[0].self_attn.o_proj.__dict__   
{'training': False,
 '_parameters': OrderedDict(),
 '_buffers': OrderedDict([('qweight',
               tensor([[ 1773757815,  1768328279, -1464370249,  ..., -2039838327,
                         2022213767,  2040039560],
                       [-1821869179,  2055439289,  2022094682,  ..., -1734768743,
                        -2004252536, -2005370488],
                       [-1490392886,  1783199093, -1737979752,  ..., -1701213575,
                        -2005305208, -1736939382],
                       ...,
                       [ 2020075661,  1488582826,  1469745272,  ...,  2031857540,
                        -2056668821,  2006354234],
                       [-1468552842,  2011772828,  1251699099,  ..., -1431862614,
                        -2055685992,  1704302774],
                       [-1231591193, -1696096940, -1984251797,  ..., -1969973080,
                         1989630649,  1773565820]], device='cuda:0', dtype=torch.int32)),
              ('qzeros',
               tensor([[2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       ...,
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071]], device='cuda:0', dtype=torch.int32)),
              ('scales',
               tensor([[0.0036, 0.0041, 0.0069,  ..., 0.0090, 0.0193, 0.0094],
                       [0.0062, 0.0057, 0.0051,  ..., 0.0055, 0.0103, 0.0052],
                       [0.0079, 0.0066, 0.0082,  ..., 0.0058, 0.0221, 0.0192],
                       ...,
                       [0.0082, 0.0104, 0.0082,  ..., 0.0107, 0.0102, 0.0122],
                       [0.0087, 0.0071, 0.0105,  ..., 0.0082, 0.0081, 0.0068],
                       [0.0089, 0.0058, 0.0119,  ..., 0.0086, 0.0095, 0.0078]],
                      device='cuda:0', dtype=torch.float16)),
              ('g_idx',
               tensor([ 0,  0,  0,  ..., 23, 23, 23], device='cuda:0', dtype=torch.int32))]),
 '_non_persistent_buffers_set': set(),
 '_backward_pre_hooks': OrderedDict(),
 '_backward_hooks': OrderedDict(),
 '_is_full_backward_hook': None,
 '_forward_hooks': OrderedDict(),
 '_forward_hooks_with_kwargs': OrderedDict(),
 '_forward_hooks_always_called': OrderedDict(),
 '_forward_pre_hooks': OrderedDict(),
 '_forward_pre_hooks_with_kwargs': OrderedDict(),
 '_state_dict_hooks': OrderedDict(),
 '_state_dict_pre_hooks': OrderedDict(),
 '_load_state_dict_pre_hooks': OrderedDict(),
 '_load_state_dict_post_hooks': OrderedDict(),
 '_modules': OrderedDict(),
 'infeatures': 3072,
 'outfeatures': 3072,
 'bits': 4,
 'group_size': 128,
 'maxq': 15,
 'bias': None,
 'half_indim': 1536,
 'use_cuda_fp16': False,
 'wf': tensor([[ 0,  4,  8, 12, 16, 20, 24, 28]], dtype=torch.int32),
 'kernel_switch_threshold': 128,
 'autogptq_cuda_available': False,
 'autogptq_cuda': None,
 'trainable': False,
 'device': device(type='meta'),
 '_is_hf_initialized': True}

it worked via model.generate, but if you load it with vLLM, it failed with weight loading

# llm = LLM(model="StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit", trust_remote_code=True )
config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]
INFO 07-08 15:25:48 gptq_marlin.py:141] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 07-08 15:25:48 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit', speculative_config=None, tokenizer='StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit, use_v2_block_manager=False, enable_prefix_caching=False)
tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]
tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]
added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/569 [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]
INFO 07-08 15:25:50 weight_utils.py:218] Using model weights format ['*.safetensors']
model.safetensors:   0%|          | 0.00/4.11G [00:00<?, ?B/s]
INFO 07-08 15:27:30 weight_utils.py:261] No model.safetensors.index.json found in remote.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 llm = LLM(model="StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit", trust_remote_code=True )

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/entrypoints/llm.py:149, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    127     raise TypeError(
    128         "There is no need to pass vision-related arguments anymore.")
    129 engine_args = EngineArgs(
    130     model=model,
    131     tokenizer=tokenizer,
   (...)
    147     **kwargs,
    148 )
--> 149 self.llm_engine = LLMEngine.from_engine_args(
    150     engine_args, usage_context=UsageContext.LLM_CLASS)
    151 self.request_counter = Counter()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/engine/llm_engine.py:414, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
    411     executor_class = GPUExecutor
    413 # Create the LLM engine.
--> 414 engine = cls(
    415     **engine_config.to_dict(),
    416     executor_class=executor_class,
    417     log_stats=not engine_args.disable_log_stats,
    418     usage_context=usage_context,
    419 )
    420 return engine

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/engine/llm_engine.py:243, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, decoding_config, observability_config, executor_class, log_stats, usage_context, stat_loggers)
    237 self.generation_config_fields = _load_generation_config_dict(
    238     model_config)
    240 self.input_processor = INPUT_REGISTRY.create_input_processor(
    241     self.model_config)
--> 243 self.model_executor = executor_class(
    244     model_config=model_config,
    245     cache_config=cache_config,
    246     parallel_config=parallel_config,
    247     scheduler_config=scheduler_config,
    248     device_config=device_config,
    249     lora_config=lora_config,
    250     multimodal_config=multimodal_config,
    251     speculative_config=speculative_config,
    252     load_config=load_config,
    253 )
    255 if not self.model_config.embedding_mode:
    256     self._initialize_kv_caches()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/executor/executor_base.py:42, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config)
     39 self.multimodal_config = multimodal_config
     40 self.speculative_config = speculative_config
---> 42 self._init_executor()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:24, in GPUExecutor._init_executor(self)
     22 self.driver_worker = self._create_worker()
     23 self.driver_worker.init_device()
---> 24 self.driver_worker.load_model()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/worker.py:133, in Worker.load_model(self)
    132 def load_model(self):
--> 133     self.model_runner.load_model()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/model_runner.py:243, in GPUModelRunnerBase.load_model(self)
    241 def load_model(self) -> None:
    242     with CudaMemoryProfiler() as m:
--> 243         self.model = get_model(
    244             model_config=self.model_config,
    245             device_config=self.device_config,
    246             load_config=self.load_config,
    247             lora_config=self.lora_config,
    248             multimodal_config=self.multimodal_config,
    249             parallel_config=self.parallel_config,
    250             scheduler_config=self.scheduler_config,
    251             cache_config=self.cache_config,
    252         )
    254     self.model_memory_usage = m.consumed_memory
    255     logger.info("Loading model weights took %.4f GB",
    256                 self.model_memory_usage / float(2**30))

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py:21, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, multimodal_config, cache_config)
     14 def get_model(*, model_config: ModelConfig, load_config: LoadConfig,
     15               device_config: DeviceConfig, parallel_config: ParallelConfig,
     16               scheduler_config: SchedulerConfig,
     17               lora_config: Optional[LoRAConfig],
     18               multimodal_config: Optional[MultiModalConfig],
     19               cache_config: CacheConfig) -> nn.Module:
     20     loader = get_model_loader(load_config)
---> 21     return loader.load_model(model_config=model_config,
     22                              device_config=device_config,
     23                              lora_config=lora_config,
     24                              multimodal_config=multimodal_config,
     25                              parallel_config=parallel_config,
     26                              scheduler_config=scheduler_config,
     27                              cache_config=cache_config)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:270, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, multimodal_config, parallel_config, scheduler_config, cache_config)
    266 with torch.device(device_config.device):
    267     model = _initialize_model(model_config, self.load_config,
    268                               lora_config, multimodal_config,
    269                               cache_config)
--> 270 model.load_weights(
    271     self._get_weights_iterator(model_config.model,
    272                                model_config.revision,
    273                                fall_back_to_pt=getattr(
    274                                    model,
    275                                    "fall_back_to_pt_during_load",
    276                                    True)), )
    278 for _, module in model.named_modules():
    279     quant_method = getattr(module, "quant_method", None)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:486, in LlamaForCausalLM.load_weights(self, weights)
    483     param = params_dict[name]
    484     weight_loader = getattr(param, "weight_loader",
    485                             default_weight_loader)
--> 486     weight_loader(param, loaded_weight)
    487 except KeyError:
    488     pass

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py:391, in MergedColumnParallelLinear.weight_loader(self, param, loaded_weight, loaded_shard_id)
    389 if output_dim is None:
    390     if needs_scalar_to_array is not None:
--> 391         param_data, loaded_weight = adjust_scalar_to_fused_array(
    392             param_data, loaded_weight, 0)
    394     assert param_data.shape == loaded_weight.shape
    395     param_data.copy_(loaded_weight)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py:61, in adjust_scalar_to_fused_array(param, loaded_weight, shard_id)
     58 # AutoFP8 scales do not have a shape
     59 # compressed-tensors scales do have a shape
     60 if len(loaded_weight.shape) != 0:
---> 61     assert loaded_weight.shape[0] == 1
     62     loaded_weight = loaded_weight[0]
     64 return param[shard_id], loaded_weight

AssertionError:

is it because of the "No model.safetensors.index.json"? or is this a BUG, or, if I was using it in the wrong way?

vllm-project / vllm

[Bug]: issue with Phi3 mini GPTQ 4Bit/8Bit #6217

Your current environment

🐛 Describe the bug