Closed w013nad closed 1 month ago
It appears it's broken for quantization in general even without the cpu offload.
python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/ndurkee/Meta-Llama-3.1-70B-Instruct/ --max-model-len 90000 -tp 4 --gpu-memory-utilization 0.99 --dtype auto --distributed-executor-backend mp --port 15001 --served-model-name /home/ndurkee/temp/llama3_70b_fixed/ --max-log-len 10 --use-v2-block-manager --disable-custom-all-reduce --enable-prefix-caching --quantization='fp8'
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: shape '[-1, 32]' is invalid for input of size 1, Traceback (most recent call last):
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] self.model_runner.load_model()
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] self.model = get_model(model_config=self.model_config,
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] quant_method.process_weights_after_loading(module)
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 212, in process_weights_after_loading
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] prepare_fp8_layer_for_marlin(layer)
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 80, in prepare_fp8_layer_for_marlin
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] marlin_scales = marlin_permute_scales(s=scales,
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 172, in marlin_permute_scales
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] s = s.reshape((-1, len(scale_perm_single)))[:, scale_perm_single]
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] RuntimeError: shape '[-1, 32]' is invalid for input of size 1
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: shape '[-1, 32]' is invalid for input of size 1, Traceback (most recent call last):
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] self.model_runner.load_model()
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] self.model = get_model(model_config=self.model_config,
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] quant_method.process_weights_after_loading(module)
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 212, in process_weights_after_loading
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] prepare_fp8_layer_for_marlin(layer)
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 80, in prepare_fp8_layer_for_marlin
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] marlin_scales = marlin_permute_scales(s=scales,
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 172, in marlin_permute_scales
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] s = s.reshape((-1, len(scale_perm_single)))[:, scale_perm_single]
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] RuntimeError: shape '[-1, 32]' is invalid for input of size 1
cc @mgoin for quantization and cpu offloading. I feel this is a quantization issue, and it might be related with your quantized model.
@w013nad do you have a hf link for the model you try to use?
cc @mgoin for quantization and cpu offloading. I feel this is a quantization issue, and it might be related with your quantized model.
@w013nad do you have a hf link for the model you try to use?
https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct
I will look into this, but are you sure you are using 0.5.4? In your logs and collect env output, it mentions 0.5.3.post1
vLLM Version: 0.5.3.post1
and
INFO 08-06 12:48:23 api_server.py:219] vLLM API server version 0.5.3.post1
Shoot, some of this was with a prerelease wheel. There seems to be 2 separate issues here:
root@96aed4dedb69:/home/ndurkee# python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Llama-3-8B-Instruct/ -tp 4 --gpu-memory-utilization 0.79 --dtype auto --distributed-executor-backend mp --port 5006 --served-model-name /home/ndurkee/temp/llama3_70b_fixed/ --max-model-len 1000 --max-log-len 10 --use-v2-block-manager --disable-custom-all-reduce --enable-prefix-caching --quantization='fp8'
INFO 08-06 18:47:55 api_server.py:339] vLLM API server version 0.5.4
INFO 08-06 18:47:55 api_server.py:340] args: Namespace(host=None, port=5006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/ndurkee/Llama-3-8B-Instruct/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.79, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='fp8', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['/home/ndurkee/temp/llama3_70b_fixed/'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=10)
WARNING 08-06 18:47:56 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-06 18:47:56 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/home/ndurkee/Llama-3-8B-Instruct/', speculative_config=None, tokenizer='/home/ndurkee/Llama-3-8B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/ndurkee/temp/llama3_70b_fixed/, use_v2_block_manager=True, enable_prefix_caching=True)
WARNING 08-06 18:47:56 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-06 18:47:56 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=3133) INFO 08-06 18:47:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3134) INFO 08-06 18:47:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3135) INFO 08-06 18:47:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 08-06 18:47:58 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3135) INFO 08-06 18:47:58 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3134) INFO 08-06 18:47:58 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3133) INFO 08-06 18:47:58 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3135) INFO 08-06 18:47:58 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-06 18:47:58 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3134) INFO 08-06 18:47:58 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3133) INFO 08-06 18:47:58 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-06 18:47:59 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f9612647eb0>, local_subscribe_port=47097, remote_subscribe_port=None)
INFO 08-06 18:47:59 model_runner.py:720] Starting to load model /home/ndurkee/Llama-3-8B-Instruct/...
(VllmWorkerProcess pid=3134) INFO 08-06 18:47:59 model_runner.py:720] Starting to load model /home/ndurkee/Llama-3-8B-Instruct/...
(VllmWorkerProcess pid=3135) INFO 08-06 18:47:59 model_runner.py:720] Starting to load model /home/ndurkee/Llama-3-8B-Instruct/...
(VllmWorkerProcess pid=3133) INFO 08-06 18:47:59 model_runner.py:720] Starting to load model /home/ndurkee/Llama-3-8B-Instruct/...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 2.62it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:00, 2.63it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:00<00:00, 3.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00, 3.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00, 3.16it/s]
WARNING 08-06 18:48:01 utils.py:578] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
ERROR 08-06 18:48:01 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3135 died, exit code: -15
INFO 08-06 18:48:01 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in init
self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in init
self.engine = self._init_engine(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in init
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in init
super().init(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init
super().init(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor
self._run_workers("load_model",
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
driver_worker_output = driver_worker_method(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model
self.model = get_model(model_config=self.model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
quant_method.process_weights_after_loading(module)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 212, in process_weights_after_loading
prepare_fp8_layer_for_marlin(layer)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 80, in prepare_fp8_layer_for_marlin
marlin_scales = marlin_permute_scales(s=scales,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 172, in marlin_permute_scales
s = s.reshape((-1, len(scale_perm_single)))[:, scale_perm_single]
RuntimeError: shape '[-1, 32]' is invalid for input of size 1
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
^CTraceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 370, in
2. GPTQ cpu offload doesn't work
root@96aed4dedb69:/home/ndurkee# python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/temp/llama3_8b_gptq -tp 4 --gpu-memory-utilization 0.79 --dtype auto --distributed-executor-backend mp --port 5006 --served-model-name /home/ndurkee/temp/llama3_70b_fixed/ --max-model-len 1000 --max-log-len 10 --use-v2-block-manager --disable-custom-all-reduce --enable-prefix-caching --cpu-offload-gb 5
INFO 08-06 18:45:29 api_server.py:339] vLLM API server version 0.5.4
INFO 08-06 18:45:29 api_server.py:340] args: Namespace(host=None, port=5006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/ndurkee/temp/llama3_8b_gptq', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=5.0, gpu_memory_utilization=0.79, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['/home/ndurkee/temp/llama3_70b_fixed/'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=10)
INFO 08-06 18:45:29 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-06 18:45:29 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-06 18:45:29 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/home/ndurkee/temp/llama3_8b_gptq', speculative_config=None, tokenizer='/home/ndurkee/temp/llama3_8b_gptq', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/ndurkee/temp/llama3_70b_fixed/, use_v2_block_manager=True, enable_prefix_caching=True)
WARNING 08-06 18:45:29 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-06 18:45:29 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2602) INFO 08-06 18:45:30 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2603) INFO 08-06 18:45:30 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2604) INFO 08-06 18:45:30 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2602) INFO 08-06 18:45:31 utils.py:841] Found nccl from library libnccl.so.2
INFO 08-06 18:45:31 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2603) INFO 08-06 18:45:31 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2602) INFO 08-06 18:45:31 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2604) INFO 08-06 18:45:31 utils.py:841] Found nccl from library libnccl.so.2
INFO 08-06 18:45:31 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2603) INFO 08-06 18:45:31 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2604) INFO 08-06 18:45:31 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-06 18:45:32 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fbf8e91a590>, local_subscribe_port=57567, remote_subscribe_port=None)
INFO 08-06 18:45:32 model_runner.py:720] Starting to load model /home/ndurkee/temp/llama3_8b_gptq...
(VllmWorkerProcess pid=2602) INFO 08-06 18:45:32 model_runner.py:720] Starting to load model /home/ndurkee/temp/llama3_8b_gptq...
(VllmWorkerProcess pid=2603) INFO 08-06 18:45:32 model_runner.py:720] Starting to load model /home/ndurkee/temp/llama3_8b_gptq...
(VllmWorkerProcess pid=2604) INFO 08-06 18:45:32 model_runner.py:720] Starting to load model /home/ndurkee/temp/llama3_8b_gptq...
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Cannot copy out of meta tensor; no data!, Traceback (most recent call last):
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] output = executor(args, kwargs)
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.model_runner.load_model()
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.model = get_model(model_config=self.model_config,
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] model = _initialize_model(model_config, self.load_config,
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 154, in _initialize_model
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] return model_class(config=model_config.hf_config,
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 384, in init
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.model = LlamaModel(config,
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 285, in init
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.start_layer, self.end_layer, self.layers = make_layers(
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 146, in make_layers
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_workerutils.py:226] [PPMissingLayer() for in range(start_layer)] + [
(VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 147, in
GPTQ does work by itself. Note that this is on A100s.
Okay I confirmed dynamic FP8 works fine on H100 but fails on A100. This is an issue with the dynamic FP8 Marlin backend.
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --quantization="fp8" --port 9000
...
File "/home/mgoin/venvs/vllm-rel/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 172, in marlin_permute_scales
s = s.reshape((-1, len(scale_perm_single)))[:, scale_perm_single]
RuntimeError: shape '[-1, 32]' is invalid for input of size 1
It does work fine with models that are already quantized to FP8 on A100:
vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-FP8 --quantization="fp8" --port 9000
...
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
I opened a tracking issue here: https://github.com/vllm-project/vllm/issues/7216 Looking into this first
@w013nad If you can build and test from source, please try my PR to fix dynamic FP8 Marlin https://github.com/vllm-project/vllm/pull/7219. It seems to fix the issue from my reproduction
I will look into GPTQ cpu offloading now
Verified that forcing GPTQ with cpu offload works:
vllm serve Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4 --cpu-offload-gb 5 --quantization gptq
...
INFO 08-06 21:20:41 gptq_marlin.py:102] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
...
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
The issue is specifically with GPTQ Marlin:
vllm serve Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4 --cpu-offload-gb 5
...
INFO 08-06 21:21:46 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
...
File "/home/mgoin/code/vllm/vllm/model_executor/models/utils.py", line 195, in <listcomp>
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
File "/home/mgoin/code/vllm/vllm/model_executor/models/utils.py", line 152, in maybe_offload_to_cpu
cpu_data.copy_(p.data)
File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_device.py", line 79, in __torch_function__
return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
@w013nad ditto for the GPTQ Marlin fix linked above ^
Thank you very much for reporting these issues and my apologies for letting them slip through this release. I added explicit tests for both of these cases so they will be caught in automation going forward.
Sorry I'm not able to build from source. I'm stuck using your nightly pypi packages or docker images due to it being a closed environment.
Looking forward to seeing this fix be released! (I am seeing the same problem)
Your current environment
🐛 Describe the bug
I'm running vllm 0.5.4. I was trying to run a GPTQ model with cpu offloading. This should have been fixed with #6960 but it appears not.