Closed gm3000 closed 4 months ago
PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A OS: Amazon Linux 2 (x86_64) GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) Clang version: Could not collect CMake version: version 3.29.0 Libc version: glibc-2.26 Python version: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] (64-bit runtime) Python platform: Linux-5.10.215-203.850.amzn2.x86_64-x86_64-with-glibc2.26 Is CUDA available: N/A CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA L4 Nvidia driver version: 535.161.08 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7R13 Processor Stepping: 1 CPU MHz: 3655.646 BogoMIPS: 5300.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Versions of relevant libraries: [pip3] numpy==1.26.4 [conda] numpy 1.26.4 py310hb13e2d6_0 conda-forge ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-3 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
quantization of Phi3 mini as 4Bit or 8Bit, but none of them worked with vllm 0.5.1
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig import torch model_id = "microsoft/Phi-3-mini-4k-instruct" quantization_config = GPTQConfig( bits=4, group_size=128, dataset="c4", desc_act=False, ) tokenizer = AutoTokenizer.from_pretrained(model_id) quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map='auto')
if load with transformers
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline torch.random.manual_seed(0) model = AutoModelForCausalLM.from_pretrained( "kaitchup/Phi-3-mini-4k-instruct-gptq-4bit", device_map="cuda", torch_dtype="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("kaitchup/Phi-3-mini-4k-instruct-gptq-4bit")
if you take a look inside:
# model.model.layers[0].self_attn.o_proj.__dict__ {'training': False, '_parameters': OrderedDict(), '_buffers': OrderedDict([('qweight', tensor([[ 1773757815, 1768328279, -1464370249, ..., -2039838327, 2022213767, 2040039560], [-1821869179, 2055439289, 2022094682, ..., -1734768743, -2004252536, -2005370488], [-1490392886, 1783199093, -1737979752, ..., -1701213575, -2005305208, -1736939382], ..., [ 2020075661, 1488582826, 1469745272, ..., 2031857540, -2056668821, 2006354234], [-1468552842, 2011772828, 1251699099, ..., -1431862614, -2055685992, 1704302774], [-1231591193, -1696096940, -1984251797, ..., -1969973080, 1989630649, 1773565820]], device='cuda:0', dtype=torch.int32)), ('qzeros', tensor([[2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], ..., [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071]], device='cuda:0', dtype=torch.int32)), ('scales', tensor([[0.0036, 0.0041, 0.0069, ..., 0.0090, 0.0193, 0.0094], [0.0062, 0.0057, 0.0051, ..., 0.0055, 0.0103, 0.0052], [0.0079, 0.0066, 0.0082, ..., 0.0058, 0.0221, 0.0192], ..., [0.0082, 0.0104, 0.0082, ..., 0.0107, 0.0102, 0.0122], [0.0087, 0.0071, 0.0105, ..., 0.0082, 0.0081, 0.0068], [0.0089, 0.0058, 0.0119, ..., 0.0086, 0.0095, 0.0078]], device='cuda:0', dtype=torch.float16)), ('g_idx', tensor([ 0, 0, 0, ..., 23, 23, 23], device='cuda:0', dtype=torch.int32))]), '_non_persistent_buffers_set': set(), '_backward_pre_hooks': OrderedDict(), '_backward_hooks': OrderedDict(), '_is_full_backward_hook': None, '_forward_hooks': OrderedDict(), '_forward_hooks_with_kwargs': OrderedDict(), '_forward_hooks_always_called': OrderedDict(), '_forward_pre_hooks': OrderedDict(), '_forward_pre_hooks_with_kwargs': OrderedDict(), '_state_dict_hooks': OrderedDict(), '_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_post_hooks': OrderedDict(), '_modules': OrderedDict(), 'infeatures': 3072, 'outfeatures': 3072, 'bits': 4, 'group_size': 128, 'maxq': 15, 'bias': None, 'half_indim': 1536, 'use_cuda_fp16': False, 'wf': tensor([[ 0, 4, 8, 12, 16, 20, 24, 28]], dtype=torch.int32), 'kernel_switch_threshold': 128, 'autogptq_cuda_available': False, 'autogptq_cuda': None, 'trainable': False, 'device': device(type='meta'), '_is_hf_initialized': True}
it worked via model.generate, but if you load it with vLLM, it failed with weight loading
# llm = LLM(model="StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit", trust_remote_code=True ) config.json: 0%| | 0.00/1.58k [00:00<?, ?B/s] INFO 07-08 15:25:48 gptq_marlin.py:141] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO 07-08 15:25:48 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit', speculative_config=None, tokenizer='StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit, use_v2_block_manager=False, enable_prefix_caching=False) tokenizer_config.json: 0%| | 0.00/3.17k [00:00<?, ?B/s] tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s] tokenizer.json: 0%| | 0.00/1.84M [00:00<?, ?B/s] added_tokens.json: 0%| | 0.00/293 [00:00<?, ?B/s] special_tokens_map.json: 0%| | 0.00/569 [00:00<?, ?B/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. generation_config.json: 0%| | 0.00/172 [00:00<?, ?B/s] INFO 07-08 15:25:50 weight_utils.py:218] Using model weights format ['*.safetensors'] model.safetensors: 0%| | 0.00/4.11G [00:00<?, ?B/s] INFO 07-08 15:27:30 weight_utils.py:261] No model.safetensors.index.json found in remote. --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[5], line 1 ----> 1 llm = LLM(model="StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit", trust_remote_code=True ) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/entrypoints/llm.py:149, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs) 127 raise TypeError( 128 "There is no need to pass vision-related arguments anymore.") 129 engine_args = EngineArgs( 130 model=model, 131 tokenizer=tokenizer, (...) 147 **kwargs, 148 ) --> 149 self.llm_engine = LLMEngine.from_engine_args( 150 engine_args, usage_context=UsageContext.LLM_CLASS) 151 self.request_counter = Counter() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/engine/llm_engine.py:414, in LLMEngine.from_engine_args(cls, engine_args, usage_context) 411 executor_class = GPUExecutor 413 # Create the LLM engine. --> 414 engine = cls( 415 **engine_config.to_dict(), 416 executor_class=executor_class, 417 log_stats=not engine_args.disable_log_stats, 418 usage_context=usage_context, 419 ) 420 return engine File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/engine/llm_engine.py:243, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, decoding_config, observability_config, executor_class, log_stats, usage_context, stat_loggers) 237 self.generation_config_fields = _load_generation_config_dict( 238 model_config) 240 self.input_processor = INPUT_REGISTRY.create_input_processor( 241 self.model_config) --> 243 self.model_executor = executor_class( 244 model_config=model_config, 245 cache_config=cache_config, 246 parallel_config=parallel_config, 247 scheduler_config=scheduler_config, 248 device_config=device_config, 249 lora_config=lora_config, 250 multimodal_config=multimodal_config, 251 speculative_config=speculative_config, 252 load_config=load_config, 253 ) 255 if not self.model_config.embedding_mode: 256 self._initialize_kv_caches() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/executor/executor_base.py:42, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config) 39 self.multimodal_config = multimodal_config 40 self.speculative_config = speculative_config ---> 42 self._init_executor() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:24, in GPUExecutor._init_executor(self) 22 self.driver_worker = self._create_worker() 23 self.driver_worker.init_device() ---> 24 self.driver_worker.load_model() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/worker.py:133, in Worker.load_model(self) 132 def load_model(self): --> 133 self.model_runner.load_model() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/model_runner.py:243, in GPUModelRunnerBase.load_model(self) 241 def load_model(self) -> None: 242 with CudaMemoryProfiler() as m: --> 243 self.model = get_model( 244 model_config=self.model_config, 245 device_config=self.device_config, 246 load_config=self.load_config, 247 lora_config=self.lora_config, 248 multimodal_config=self.multimodal_config, 249 parallel_config=self.parallel_config, 250 scheduler_config=self.scheduler_config, 251 cache_config=self.cache_config, 252 ) 254 self.model_memory_usage = m.consumed_memory 255 logger.info("Loading model weights took %.4f GB", 256 self.model_memory_usage / float(2**30)) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py:21, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, multimodal_config, cache_config) 14 def get_model(*, model_config: ModelConfig, load_config: LoadConfig, 15 device_config: DeviceConfig, parallel_config: ParallelConfig, 16 scheduler_config: SchedulerConfig, 17 lora_config: Optional[LoRAConfig], 18 multimodal_config: Optional[MultiModalConfig], 19 cache_config: CacheConfig) -> nn.Module: 20 loader = get_model_loader(load_config) ---> 21 return loader.load_model(model_config=model_config, 22 device_config=device_config, 23 lora_config=lora_config, 24 multimodal_config=multimodal_config, 25 parallel_config=parallel_config, 26 scheduler_config=scheduler_config, 27 cache_config=cache_config) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:270, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, multimodal_config, parallel_config, scheduler_config, cache_config) 266 with torch.device(device_config.device): 267 model = _initialize_model(model_config, self.load_config, 268 lora_config, multimodal_config, 269 cache_config) --> 270 model.load_weights( 271 self._get_weights_iterator(model_config.model, 272 model_config.revision, 273 fall_back_to_pt=getattr( 274 model, 275 "fall_back_to_pt_during_load", 276 True)), ) 278 for _, module in model.named_modules(): 279 quant_method = getattr(module, "quant_method", None) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:486, in LlamaForCausalLM.load_weights(self, weights) 483 param = params_dict[name] 484 weight_loader = getattr(param, "weight_loader", 485 default_weight_loader) --> 486 weight_loader(param, loaded_weight) 487 except KeyError: 488 pass File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py:391, in MergedColumnParallelLinear.weight_loader(self, param, loaded_weight, loaded_shard_id) 389 if output_dim is None: 390 if needs_scalar_to_array is not None: --> 391 param_data, loaded_weight = adjust_scalar_to_fused_array( 392 param_data, loaded_weight, 0) 394 assert param_data.shape == loaded_weight.shape 395 param_data.copy_(loaded_weight) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py:61, in adjust_scalar_to_fused_array(param, loaded_weight, shard_id) 58 # AutoFP8 scales do not have a shape 59 # compressed-tensors scales do have a shape 60 if len(loaded_weight.shape) != 0: ---> 61 assert loaded_weight.shape[0] == 1 62 loaded_weight = loaded_weight[0] 64 return param[shard_id], loaded_weight AssertionError:
is it because of the "No model.safetensors.index.json"? or is this a BUG, or, if I was using it in the wrong way?
This is a bug - I will put up a patch
Fixed by https://github.com/vllm-project/vllm/pull/6238
Your current environment
🐛 Describe the bug
quantization of Phi3 mini as 4Bit or 8Bit, but none of them worked with vllm 0.5.1
if load with transformers
if you take a look inside:
it worked via model.generate, but if you load it with vLLM, it failed with weight loading
is it because of the "No model.safetensors.index.json"? or is this a BUG, or, if I was using it in the wrong way?