vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.61k stars 3.9k forks source link

[Bug]: EfficientQAT GPTQ Does load but does not output through api #7300

Closed derpyhue closed 4 weeks ago

derpyhue commented 1 month ago

Your current environment

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 232, in generate
    message = await socket.recv()
              ^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 202, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 189, in create_chat_completion
    return await self.chat_completion_full_generator(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 439, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 398, in iterate_with_cancellation
    item = await awaits[0]
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 242, in generate
    await self.abort(request_id)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 192, in abort
    await self._send_one_way_rpc_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 105, in _send_one_way_rpc_request
    with self.socket() as socket:
  File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 62, in socket
    socket = self.context.socket(zmq.constants.DEALER)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/zmq/sugar/context.py", line 350, in socket
    raise ZMQError(Errno.ENOTSUP)
zmq.error.ZMQError: Operation not supported
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
an error occurred during closing of asynchronous generator <async_generator object AsyncEngineRPCClient.generate at 0x7deafd97a820>
asyncgen: <async_generator object AsyncEngineRPCClient.generate at 0x7deafd97a820>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 239, in generate
    yield request_output
GeneratorExit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 242, in generate
    await self.abort(request_id)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 192, in abort
    await self._send_one_way_rpc_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 105, in _send_one_way_rpc_request
    with self.socket() as socket:
  File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 62, in socket
    socket = self.context.socket(zmq.constants.DEALER)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/zmq/sugar/context.py", line 350, in socket
    raise ZMQError(Errno.ENOTSUP)
zmq.error.ZMQError: Operation not supported

šŸ› Describe the bug

Was trying out ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ in vllm it does load completely. But when you post a message it keeps running like normal but it does not output anything.

When closing vllm it does output the error above. It is using this quantization https://github.com/OpenGVLab/EfficientQAT

Hope this is workable, been experimenting a lot with vllm really love it! Thank you for your time.

robertgshaw2-neuralmagic commented 1 month ago

Can you try running with --disable-frontend-multiprocessing? I just want to rule out this is a ZeroMQ issue

derpyhue commented 1 month ago

Woah that was fast. Will try it now!

derpyhue commented 1 month ago

It does fix the error however the problem still persisted. I think the error was related to something else. https://github.com/OpenGVLab/EfficientQAT It is quite new can imagine it needs some more time to develop.

However thanks for responding so fast!

For some more context i'm using docker run --name vllm --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=" -p 8000:8000 --ipc=host vllm/vllm-openai --model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ --dtype auto --disable-custom-all-reduce --max-model-len 12288 --max-seq-len-to-capture 12288 -tp 4 --max_num_seqs 4 --use-v2-block-manager --gpu-memory-utilization 0.93 --swap-space 2 --enable-chunked-prefill --max_num_batched_tokens 256 --disable-frontend-multiprocessing

With 4 rtx 3060's

robertgshaw2-neuralmagic commented 1 month ago

Can you post the stack trace with --disable-frontend-multiprocessing

derpyhue commented 1 month ago
INFO 08-08 13:01:26 api_server.py:352] vLLM API server version 0.5.4
INFO 08-08 13:01:26 api_server.py:353] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=True, model='ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=12288, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=2, cpu_offload_gb=0, gpu_memory_utilization=0.93, num_gpu_blocks_override=None, max_num_batched_tokens=256, max_num_seqs=4, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=12288, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 08-08 13:01:26 config.py:286] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 08-08 13:01:26 config.py:286] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-08 13:01:26 config.py:762] Defaulting to use mp for distributed inference
INFO 08-08 13:01:26 config.py:853] Chunked prefill is enabled with max_num_batched_tokens=256.
INFO 08-08 13:01:26 llm_engine.py:176] Initializing an LLM engine (v0.5.4) with config: model='ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ', speculative_config=None, tokenizer='ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=12288, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ, use_v2_block_manager=True, enable_prefix_caching=False)
WARNING 08-08 13:01:27 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 6 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-08 13:01:27 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=36) INFO 08-08 13:01:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=35) INFO 08-08 13:01:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=37) INFO 08-08 13:01:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=35) INFO 08-08 13:01:28 utils.py:942] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=35) INFO 08-08 13:01:28 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=36) INFO 08-08 13:01:28 utils.py:942] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=37) INFO 08-08 13:01:28 utils.py:942] Found nccl from library libnccl.so.2
INFO 08-08 13:01:28 utils.py:942] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=36) INFO 08-08 13:01:28 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=37) INFO 08-08 13:01:28 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-08 13:01:28 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-08 13:01:28 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x775fd6d78b30>, local_subscribe_port=35941, remote_subscribe_port=None)
INFO 08-08 13:01:28 model_runner.py:721] Starting to load model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ...
(VllmWorkerProcess pid=36) INFO 08-08 13:01:28 model_runner.py:721] Starting to load model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ...
(VllmWorkerProcess pid=35) INFO 08-08 13:01:28 model_runner.py:721] Starting to load model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ...
(VllmWorkerProcess pid=37) INFO 08-08 13:01:28 model_runner.py:721] Starting to load model ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ...
INFO 08-08 13:01:29 weight_utils.py:231] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=35) INFO 08-08 13:01:29 weight_utils.py:231] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=37) INFO 08-08 13:01:29 weight_utils.py:231] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=36) INFO 08-08 13:01:29 weight_utils.py:231] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:06<00:27,  6.76s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:18<00:29,  9.74s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:28<00:19,  9.85s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:40<00:10, 10.71s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:53<00:00, 11.40s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:53<00:00, 10.64s/it]

INFO 08-08 13:02:23 model_runner.py:733] Loading model weights took 8.6544 GB
(VllmWorkerProcess pid=37) INFO 08-08 13:02:23 model_runner.py:733] Loading model weights took 8.6544 GB
(VllmWorkerProcess pid=36) INFO 08-08 13:02:23 model_runner.py:733] Loading model weights took 8.6544 GB
(VllmWorkerProcess pid=35) INFO 08-08 13:02:23 model_runner.py:733] Loading model weights took 8.6544 GB
INFO 08-08 13:02:27 distributed_gpu_executor.py:56] # GPU blocks: 1194, # CPU blocks: 1489
(VllmWorkerProcess pid=36) INFO 08-08 13:02:33 model_runner.py:1025] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=36) INFO 08-08 13:02:33 model_runner.py:1029] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=35) INFO 08-08 13:02:33 model_runner.py:1025] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=35) INFO 08-08 13:02:33 model_runner.py:1029] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=37) INFO 08-08 13:02:33 model_runner.py:1025] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-08 13:02:33 model_runner.py:1025] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-08 13:02:33 model_runner.py:1029] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=37) INFO 08-08 13:02:33 model_runner.py:1029] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-08 13:02:35 model_runner.py:1226] Graph capturing finished in 2 secs.
(VllmWorkerProcess pid=37) INFO 08-08 13:02:35 model_runner.py:1226] Graph capturing finished in 2 secs.
(VllmWorkerProcess pid=36) INFO 08-08 13:02:35 model_runner.py:1226] Graph capturing finished in 2 secs.
(VllmWorkerProcess pid=35) INFO 08-08 13:02:35 model_runner.py:1226] Graph capturing finished in 2 secs.
WARNING 08-08 13:02:35 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-08 13:02:35 launcher.py:14] Available routes are:
INFO 08-08 13:02:35 launcher.py:22] Route: /openapi.json, Methods: HEAD, GET
INFO 08-08 13:02:35 launcher.py:22] Route: /docs, Methods: HEAD, GET
INFO 08-08 13:02:35 launcher.py:22] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-08 13:02:35 launcher.py:22] Route: /redoc, Methods: HEAD, GET
INFO 08-08 13:02:35 launcher.py:22] Route: /health, Methods: GET
INFO 08-08 13:02:35 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-08 13:02:35 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-08 13:02:35 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-08 13:02:35 launcher.py:22] Route: /version, Methods: GET
INFO 08-08 13:02:35 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-08 13:02:35 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-08 13:02:35 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     172.17.0.1:38696 - "GET /v1/models HTTP/1.1" 200 OK
INFO 08-08 13:02:42 logger.py:36] Received request chat-3302c630be5f47a183541db925fdc83f: prompt: "<s>[INST] Show me a code snippet of a website's sticky header in CSS and JavaScript.[/INST]", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=12266, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=8601), prompt_token_ids: [1, 3, 9378, 1296, 1032, 3464, 3270, 28351, 1070, 1032, 5168, 29510, 29481, 7674, 29492, 8503, 1065, 18690, 1072, 27049, 29491, 4], lora_request: None, prompt_adapter_request: None.
INFO:     172.17.0.1:38702 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-08 13:02:42 async_llm_engine.py:193] Added request chat-3302c630be5f47a183541db925fdc83f.
INFO 08-08 13:02:43 metrics.py:406] Avg prompt throughput: 2.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-08 13:02:48 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO 08-08 13:02:53 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.3%, CPU KV cache usage: 0.0%.
INFO 08-08 13:02:58 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.9%, CPU KV cache usage: 0.0%.
INFO 08-08 13:02:59 logger.py:36] Received request chat-ac37d88686dd481a9f58b240c1e55f4e: prompt: "<s>[INST] Here is the query:\nShow me a code snippet of a website's sticky header in CSS and JavaScript.\n\nCreate a concise, 3-5 word phrase with an emoji as a title for the previous query. Suitable Emojis for the summary can be used to enhance understanding but avoid quotation marks or special formatting. RESPOND ONLY WITH THE TITLE TEXT.\n\nExamples of titles:\nšŸ“‰ Stock Market Trends\nšŸŖ Perfect Chocolate Chip Recipe\nEvolution of Music Streaming\nRemote Work Productivity Tips\nArtificial Intelligence in Healthcare\nšŸŽ® Video Game Development Insights[/INST]", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=50, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=8601), prompt_token_ids: [1, 3, 4771, 1117, 1040, 6477, 29515, 781, 9166, 1296, 1032, 3464, 3270, 28351, 1070, 1032, 5168, 29510, 29481, 7674, 29492, 8503, 1065, 18690, 1072, 27049, 29491, 781, 781, 4766, 1032, 3846, 1632, 29493, 29473, 29538, 29501, 29550, 2475, 15572, 1163, 1164, 1645, 28581, 1158, 1032, 4709, 1122, 1040, 4222, 6477, 29491, 3442, 6147, 3697, 6822, 1046, 1122, 1040, 14828, 1309, 1115, 2075, 1066, 12744, 7167, 1330, 5229, 18296, 1120, 14959, 1210, 3609, 1989, 15526, 29491, 21076, 29521, 1600, 29525, 10456, 10648, 18742, 4567, 1088, 1921, 1948, 26543, 29491, 781, 781, 1734, 10642, 1070, 16541, 29515, 781, 1011, 930, 918, 908, 12316, 12411, 1088, 5850, 29481, 781, 1011, 930, 912, 941, 25211, 1457, 12727, 1457, 1276, 4291, 4753, 781, 9227, 2868, 1070, 8530, 16683, 1056, 781, 16246, 5834, 9836, 3342, 27174, 781, 11131, 15541, 23859, 1065, 7145, 8769, 781, 1011, 930, 913, 945, 13041, 9047, 11108, 10281, 3920, 4], lora_request: None, prompt_adapter_request: None.
INFO 08-08 13:02:59 async_llm_engine.py:193] Added request chat-ac37d88686dd481a9f58b240c1e55f4e.
INFO 08-08 13:02:59 async_llm_engine.py:204] Aborted request chat-3302c630be5f47a183541db925fdc83f.
INFO 08-08 13:03:02 async_llm_engine.py:160] Finished request chat-ac37d88686dd481a9f58b240c1e55f4e.
INFO:     172.17.0.1:52468 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-08 13:03:05 metrics.py:406] Avg prompt throughput: 19.8 tokens/s, Avg generation throughput: 10.7 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:15 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:25 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:35 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:45 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:03:55 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:04:05 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-08 13:04:12 launcher.py:45] Gracefully stopping http server
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO 08-08 13:04:12 async_llm_engine.py:54] Engine is gracefully shutting down.
ERROR 08-08 13:04:12 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 36 died, exit code: -15

Very strange everything seems in order it even outputs the message in the logs but it does not output it to the fronted (open webui). It does output garbage though. However llama 3.1 70b awq 4bit seems to do fine. Did see someone else saying 'The model runs and processes tokens, however there're some issues with serving those from OAI vLLM API - so no luck'

derpyhue commented 4 weeks ago

I'm gonna close it for now as it seems that it is not a issue with vllm. Thanks for the input though!