Closed derpyhue closed 4 weeks ago
Can you try running with --disable-frontend-multiprocessing
? I just want to rule out this is a ZeroMQ issue
Woah that was fast. Will try it now!
It does fix the error however the problem still persisted. I think the error was related to something else. https://github.com/OpenGVLab/EfficientQAT It is quite new can imagine it needs some more time to develop.
However thanks for responding so fast!
For some more context i'm using
docker run --name vllm --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=" -p 8000:8000 --ipc=host vllm/vllm-openai --model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ --dtype auto --disable-custom-all-reduce --max-model-len 12288 --max-seq-len-to-capture 12288 -tp 4 --max_num_seqs 4 --use-v2-block-manager --gpu-memory-utilization 0.93 --swap-space 2 --enable-chunked-prefill --max_num_batched_tokens 256 --disable-frontend-multiprocessing
With 4 rtx 3060's
Can you post the stack trace with --disable-frontend-multiprocessing
INFO 08-08 13:01:26 api_server.py:352] vLLM API server version 0.5.4
INFO 08-08 13:01:26 api_server.py:353] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=True, model='ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=12288, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=2, cpu_offload_gb=0, gpu_memory_utilization=0.93, num_gpu_blocks_override=None, max_num_batched_tokens=256, max_num_seqs=4, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=12288, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 08-08 13:01:26 config.py:286] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 08-08 13:01:26 config.py:286] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-08 13:01:26 config.py:762] Defaulting to use mp for distributed inference
INFO 08-08 13:01:26 config.py:853] Chunked prefill is enabled with max_num_batched_tokens=256.
INFO 08-08 13:01:26 llm_engine.py:176] Initializing an LLM engine (v0.5.4) with config: model='ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ', speculative_config=None, tokenizer='ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=12288, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ, use_v2_block_manager=True, enable_prefix_caching=False)
WARNING 08-08 13:01:27 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 6 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-08 13:01:27 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=36) INFO 08-08 13:01:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=35) INFO 08-08 13:01:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=37) INFO 08-08 13:01:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=35) INFO 08-08 13:01:28 utils.py:942] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=35) INFO 08-08 13:01:28 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=36) INFO 08-08 13:01:28 utils.py:942] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=37) INFO 08-08 13:01:28 utils.py:942] Found nccl from library libnccl.so.2
INFO 08-08 13:01:28 utils.py:942] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=36) INFO 08-08 13:01:28 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=37) INFO 08-08 13:01:28 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-08 13:01:28 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-08 13:01:28 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x775fd6d78b30>, local_subscribe_port=35941, remote_subscribe_port=None)
INFO 08-08 13:01:28 model_runner.py:721] Starting to load model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ...
(VllmWorkerProcess pid=36) INFO 08-08 13:01:28 model_runner.py:721] Starting to load model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ...
(VllmWorkerProcess pid=35) INFO 08-08 13:01:28 model_runner.py:721] Starting to load model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ...
(VllmWorkerProcess pid=37) INFO 08-08 13:01:28 model_runner.py:721] Starting to load model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ...
INFO 08-08 13:01:29 weight_utils.py:231] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=35) INFO 08-08 13:01:29 weight_utils.py:231] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=37) INFO 08-08 13:01:29 weight_utils.py:231] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=36) INFO 08-08 13:01:29 weight_utils.py:231] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:06<00:27, 6.76s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:18<00:29, 9.74s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:28<00:19, 9.85s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:40<00:10, 10.71s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:53<00:00, 11.40s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:53<00:00, 10.64s/it]
INFO 08-08 13:02:23 model_runner.py:733] Loading model weights took 8.6544 GB
(VllmWorkerProcess pid=37) INFO 08-08 13:02:23 model_runner.py:733] Loading model weights took 8.6544 GB
(VllmWorkerProcess pid=36) INFO 08-08 13:02:23 model_runner.py:733] Loading model weights took 8.6544 GB
(VllmWorkerProcess pid=35) INFO 08-08 13:02:23 model_runner.py:733] Loading model weights took 8.6544 GB
INFO 08-08 13:02:27 distributed_gpu_executor.py:56] # GPU blocks: 1194, # CPU blocks: 1489
(VllmWorkerProcess pid=36) INFO 08-08 13:02:33 model_runner.py:1025] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=36) INFO 08-08 13:02:33 model_runner.py:1029] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=35) INFO 08-08 13:02:33 model_runner.py:1025] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=35) INFO 08-08 13:02:33 model_runner.py:1029] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=37) INFO 08-08 13:02:33 model_runner.py:1025] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-08 13:02:33 model_runner.py:1025] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-08 13:02:33 model_runner.py:1029] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=37) INFO 08-08 13:02:33 model_runner.py:1029] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-08 13:02:35 model_runner.py:1226] Graph capturing finished in 2 secs.
(VllmWorkerProcess pid=37) INFO 08-08 13:02:35 model_runner.py:1226] Graph capturing finished in 2 secs.
(VllmWorkerProcess pid=36) INFO 08-08 13:02:35 model_runner.py:1226] Graph capturing finished in 2 secs.
(VllmWorkerProcess pid=35) INFO 08-08 13:02:35 model_runner.py:1226] Graph capturing finished in 2 secs.
WARNING 08-08 13:02:35 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-08 13:02:35 launcher.py:14] Available routes are:
INFO 08-08 13:02:35 launcher.py:22] Route: /openapi.json, Methods: HEAD, GET
INFO 08-08 13:02:35 launcher.py:22] Route: /docs, Methods: HEAD, GET
INFO 08-08 13:02:35 launcher.py:22] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-08 13:02:35 launcher.py:22] Route: /redoc, Methods: HEAD, GET
INFO 08-08 13:02:35 launcher.py:22] Route: /health, Methods: GET
INFO 08-08 13:02:35 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-08 13:02:35 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-08 13:02:35 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-08 13:02:35 launcher.py:22] Route: /version, Methods: GET
INFO 08-08 13:02:35 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-08 13:02:35 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-08 13:02:35 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 172.17.0.1:38696 - "GET /v1/models HTTP/1.1" 200 OK
INFO 08-08 13:02:42 logger.py:36] Received request chat-3302c630be5f47a183541db925fdc83f: prompt: "<s>[INST] Show me a code snippet of a website's sticky header in CSS and JavaScript.[/INST]", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=12266, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=8601), prompt_token_ids: [1, 3, 9378, 1296, 1032, 3464, 3270, 28351, 1070, 1032, 5168, 29510, 29481, 7674, 29492, 8503, 1065, 18690, 1072, 27049, 29491, 4], lora_request: None, prompt_adapter_request: None.
INFO: 172.17.0.1:38702 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-08 13:02:42 async_llm_engine.py:193] Added request chat-3302c630be5f47a183541db925fdc83f.
INFO 08-08 13:02:43 metrics.py:406] Avg prompt throughput: 2.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-08 13:02:48 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO 08-08 13:02:53 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.3%, CPU KV cache usage: 0.0%.
INFO 08-08 13:02:58 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.9%, CPU KV cache usage: 0.0%.
INFO 08-08 13:02:59 logger.py:36] Received request chat-ac37d88686dd481a9f58b240c1e55f4e: prompt: "<s>[INST] Here is the query:\nShow me a code snippet of a website's sticky header in CSS and JavaScript.\n\nCreate a concise, 3-5 word phrase with an emoji as a title for the previous query. Suitable Emojis for the summary can be used to enhance understanding but avoid quotation marks or special formatting. RESPOND ONLY WITH THE TITLE TEXT.\n\nExamples of titles:\nš Stock Market Trends\nšŖ Perfect Chocolate Chip Recipe\nEvolution of Music Streaming\nRemote Work Productivity Tips\nArtificial Intelligence in Healthcare\nš® Video Game Development Insights[/INST]", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=50, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=8601), prompt_token_ids: [1, 3, 4771, 1117, 1040, 6477, 29515, 781, 9166, 1296, 1032, 3464, 3270, 28351, 1070, 1032, 5168, 29510, 29481, 7674, 29492, 8503, 1065, 18690, 1072, 27049, 29491, 781, 781, 4766, 1032, 3846, 1632, 29493, 29473, 29538, 29501, 29550, 2475, 15572, 1163, 1164, 1645, 28581, 1158, 1032, 4709, 1122, 1040, 4222, 6477, 29491, 3442, 6147, 3697, 6822, 1046, 1122, 1040, 14828, 1309, 1115, 2075, 1066, 12744, 7167, 1330, 5229, 18296, 1120, 14959, 1210, 3609, 1989, 15526, 29491, 21076, 29521, 1600, 29525, 10456, 10648, 18742, 4567, 1088, 1921, 1948, 26543, 29491, 781, 781, 1734, 10642, 1070, 16541, 29515, 781, 1011, 930, 918, 908, 12316, 12411, 1088, 5850, 29481, 781, 1011, 930, 912, 941, 25211, 1457, 12727, 1457, 1276, 4291, 4753, 781, 9227, 2868, 1070, 8530, 16683, 1056, 781, 16246, 5834, 9836, 3342, 27174, 781, 11131, 15541, 23859, 1065, 7145, 8769, 781, 1011, 930, 913, 945, 13041, 9047, 11108, 10281, 3920, 4], lora_request: None, prompt_adapter_request: None.
INFO 08-08 13:02:59 async_llm_engine.py:193] Added request chat-ac37d88686dd481a9f58b240c1e55f4e.
INFO 08-08 13:02:59 async_llm_engine.py:204] Aborted request chat-3302c630be5f47a183541db925fdc83f.
INFO 08-08 13:03:02 async_llm_engine.py:160] Finished request chat-ac37d88686dd481a9f58b240c1e55f4e.
INFO: 172.17.0.1:52468 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-08 13:03:05 metrics.py:406] Avg prompt throughput: 19.8 tokens/s, Avg generation throughput: 10.7 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:15 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:25 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:35 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:45 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:55 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:04:05 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:04:12 launcher.py:45] Gracefully stopping http server
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO 08-08 13:04:12 async_llm_engine.py:54] Engine is gracefully shutting down.
ERROR 08-08 13:04:12 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 36 died, exit code: -15
Very strange everything seems in order it even outputs the message in the logs but it does not output it to the fronted (open webui). It does output garbage though. However llama 3.1 70b awq 4bit seems to do fine. Did see someone else saying 'The model runs and processes tokens, however there're some issues with serving those from OAI vLLM API - so no luck'
I'm gonna close it for now as it seems that it is not a issue with vllm. Thanks for the input though!
Your current environment
š Describe the bug
Was trying out ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ in vllm it does load completely. But when you post a message it keeps running like normal but it does not output anything.
When closing vllm it does output the error above. It is using this quantization https://github.com/OpenGVLab/EfficientQAT
Hope this is workable, been experimenting a lot with vllm really love it! Thank you for your time.