vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.09k stars 3.82k forks source link

[Bug]: vLLM server crashes when `echo=True` and `max_tokens=0` #6521

Open gabrielhuang opened 1 month ago

gabrielhuang commented 1 month ago

Your current environment

vLLM server latest, as of July 17th 2024: vllm/vllm-openai:latest

🐛 Describe the bug

I'm trying to get the logprobability of the last token (Yes or No) of the prompt.

In my client, I set:

import requests

generation_args = {
    "prompt": "Your answer: Yes",
    "temperature": self.temperature,
    "top_p": self.top_p if self.do_sample else 1.0,
    "top_k": self.top_k if self.do_sample else -1,
    "best_of": self.best_of,
    "repetition_penalty": self.repetition_penalty if self.do_sample else 1.0,
    "max_tokens": 0, # no additional tokens to generate (using 1 to not crash vllm server?)
    "n": 1,
    "model": self.model_alias,
    "echo": True,    # include prompt in response, must be true for getting prompt logprobs, see https://github.com/vllm-project/vllm/blob/5f0b9933e63839e816b9736a65a3c55005df2cfe/vllm/entrypoints/openai/protocol.py#L634
    "logprobs": 1
}

url = f"{self.base_url}/v1/completions"
r = requests.post(url, json=generation_args

However, as soon as i query the vLLM server, it crashes with the following message inthe logs

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 44, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 610, in run_engine_loop
INFO 07-17 21:39:42 async_llm_engine.py:169] Aborted request cmpl-7e558e6ae69b4b389dfc33990c172649-0.
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 553, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 242, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 173, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 165, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
INFO:     10.210.207.136:51370 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 271, in execute_model
    output = self.model_runner.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1361, in execute_model
    output: SamplerOutput = self.model.sample(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 407, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 112, in forward
    prompt_logprobs, sample_logprobs = _get_logprobs(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 763, in _get_logprobs
    assert len(next_token_ids) == len(query_indices)
AssertionError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 56, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 139, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 190, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 319, in consumer
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 310, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 294, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 765, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 881, in _process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 877, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 91, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 44, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 610, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 553, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 242, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 173, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 165, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 271, in execute_model
    output = self.model_runner.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1361, in execute_model
    output: SamplerOutput = self.model.sample(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 407, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 112, in forward
    prompt_logprobs, sample_logprobs = _get_logprobs(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 763, in _get_logprobs
    assert len(next_token_ids) == len(query_indices)
AssertionError
INFO 07-17 21:39:48 metrics.py:295] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.0%, CPU KV cache usage: 0.0%.
INFO:     10.210.208.13:43112 - "GET /health HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 83, in health
    await openai_serving_chat.engine.check_health()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 943, in check_health
    raise AsyncEngineDeadError("Background loop is stopped.")
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
gabrielhuang commented 1 month ago

In case it helps others, using max_tokens=1 solved the issue

DarkLight1337 commented 1 month ago

Actually let me reopen this since it appears that we're supposed to specifically allow max_tokens=0 when echo=True.

https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/protocol.py#L399