Open Maydaytyh opened 6 months ago
I have same error.
ERROR 04-24 21:28:44 worker_base.py:157] KeyError: 'model.layers.55.mlp.down_proj.in_scale' KeyError: 'model.layers.55.mlp.down_proj.in_scale' (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] Error executing method load_model. This might cause deadlock in distributed execution. (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] Traceback (most recent call last): (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] File "/home/d/anaconda3/envs/3.8/lib/python3.8/site-packages/vllm/worker/worker_base.py", line 149, in execute_method (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] return executor(*args, **kwargs) (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] File "/home/d/anaconda3/envs/3.8/lib/python3.8/site-packages/vllm/worker/worker.py", line 117, in load_model (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] self.model_runner.load_model() (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] File "/home/d/anaconda3/envs/3.8/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 162, in load_model (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] self.model = get_model( (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] File "/home/d/anaconda3/envs/3.8/lib/python3.8/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] return loader.load_model(model_config=model_config, (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] File "/home/d/anaconda3/envs/3.8/lib/python3.8/site-packages/vllm/model_executor/model_loader/loader.py", line 224, in load_model (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] model.load_weights( (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] File "/home/d/anaconda3/envs/3.8/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 411, in load_weights (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] param = params_dict[name] (RayWorkerWrapper pid=3766121) ERROR 04-24 21:28:45 worker_base.py:157] KeyError: 'model.layers.55.mlp.down_proj.in_scale'
Hi @Maydaytyh
GPU 1: NVIDIA RTX A6000 GPU 2: NVIDIA RTX A6000 [...] ValueError: The quantization method fp8 is not supported for the current GPU. Minimum capability: 90. Current capability: 86.
FP8 is only supported on >=sm90 i.e. Hopper cards. (Per fp8.py, support for sm89 (Ada, 4090) may come once vLLM upgrades to Pytorch 2.3.0)
AWQ and GPTQ quantization are much less hardware-specific, you might try using those.
Yes this is intentional, at the moment FP8 will only be supported where we have native hardware support.
Your current environment
🐛 Describe the bug
And the error is ValueError: The quantization method fp8 is not supported for the current GPU. Minimum capability: 90. Current capability: 86. (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] Error executing method load_model. This might cause deadlock in distributed execution. (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] Traceback (most recent call last): (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/worker/worker_base.py", line 149, in execute_method (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] return executor(*args, *kwargs) (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/worker/worker.py", line 117, in load_model (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] self.model_runner.load_model() (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 162, in load_model (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] self.model = get_model( (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] return loader.load_model(model_config=model_config, (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 222, in load_model (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] model = _initialize_model(model_config, self.load_config, (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 88, in _initialize_model (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] linear_method = _get_linear_method(model_config, load_config) (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 47, in _get_linear_method (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] raise ValueError( (RayWorkerWrapper pid=2202490) ERROR 04-24 16:06:40 worker_base.py:157] ValueError: The quantization method fp8 is not supported for the current GPU. Minimum capability: 90. Current capability: 86. (RayWorkerWrapper pid=2202821) INFO 04-24 16:06:36 pynccl_utils.py:43] vLLM is using nccl==2.18.1 [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) WARNING 04-24 16:06:39 custom_all_reduce.py:65] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] Error executing method load_model. This might cause deadlock in distributed execution. [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] Traceback (most recent call last): [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/worker/worker_base.py", line 149, in execute_method [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] return executor(args, **kwargs) [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 222, in load_model [repeated 6x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] self.model_runner.load_model() [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] self.model = get_model( [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] return loader.load_model(model_config=model_config, [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] model = _initialize_model(model_config, self.load_config, [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 88, in _initialize_model [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] linear_method = _get_linear_method(model_config, load_config) [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] File "/data/tianyuhang/.conda/envs/llama3/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 47, in _get_linear_method [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] raise ValueError( [repeated 2x across cluster] (RayWorkerWrapper pid=2202821) ERROR 04-24 16:06:40 worker_base.py:157] ValueError: The quantization method fp8 is not supported for the current GPU. Minimum capability: 90. Current capability: 86. [repeated 2x across cluster]