Bug when input top_k as a float that is outside of range

Hello, I found a small bug which takes me the whole morning to deal with. If we accidentally send a request with top_k = .01 for some float, the request will fail, which is obvious. However, this request will crash the whole server so that every subsequent requests even with correct top_k will not be served, I need to manually restart the server.

It is possible ensure the input data in correct type and so on, however, it must be nice when the server is in this crashed stage it send some thing to supervisord to be rebooted.

This happened in

vllm: 0.3.3 and 0.3.2 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish raise exc File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish task.result() File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 414, in run_engine_loop has_requests_in_progress = await self.engine_step() File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 393, in engine_step request_outputs = await self.engine.step_async() File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 189, in step_async all_outputs = await self._run_workers_async( File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 276, in _run_workers_async all_outputs = await asyncio.gather(coros) File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(self.args, self.kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 223, in execute_model output = self.model_runner.execute_model(seq_group_metadata_list, File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/vllm/worker/model_runner.py", line 571, in execute_model lora_mapping) = self.prepare_input_tensors(seq_group_metadata_list) File "/usr/local/lib/python3.8/dist-packages/vllm/worker/model_runner.py", line 490, in prepare_input_tensors lora_requests) = self._prepare_prompt(seq_group_metadata_list) File "/usr/local/lib/python3.8/dist-packages/vllm/worker/model_runner.py", line 208, in _prepare_prompt input_tokens = _make_tensor_with_pad(input_tokens, File "/usr/local/lib/python3.8/dist-packages/vllm/worker/model_runner.py", line 874, in _make_tensor_with_pad return torch.tensor(padded_x, dtype=dtype, device=device) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 412, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__ return await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 758, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 778, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 299, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 79, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 74, in app response = await func(request) File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/api_server.py", line 67, in generate async for request_output in results_generator: File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 577, in generate raise e File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 571, in generate async for request_output in stream: File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 69, in __anext__ raise result File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish raise exc File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

vllm-project / vllm

Bug when input top_k as a float that is outside of range #3341