Open frittentheke opened 3 months ago
This is a good idea.
Do you have capacity to implement it?
@robertgshaw2-neuralmagic Note that a previous PR for this functionality was rejected, see https://github.com/vllm-project/vllm/pull/1244.
I'm not sure if the goal of api_server
is just "demo ware you should fork", as suggested in that PR, or if it should be "simple, but production-ready".
entrypoints/api_server
is demowareentrypoints/openai/api_server
is productionIn any case, a simple wiring of /health
AND /ready
also for the demoware doesn't hurt. It does not have to do much in the sense of actually determining some ready state. But having those "stub" endpoints and handlers implemented, is somewhat of an interface or blueprint for derived implementations to #put some code here#
Getting to the openai API server:
If we agree that this feature should be implemented somehow, what would be a good sources / starting point to determine vLLM is ready to receive requests? So to follow my initial thoughts, how do I determine if the model is downloaded and loaded / imported into the GPU, KV is setup?
Coming to your question (https://github.com/vllm-project/vllm/issues/6073#issuecomment-2205941456) @robertgshaw2-neuralmagic, I might be able to come up with a PR, but would love some more discussion on the implementation and "quality" of the reported status (be it vLLMs health or readiness). As for health, I'd like to expose as much of a proper health check as possible. Round-tripping a static API endpoint is nice, but does not go much beyond taking the existence of a running process as health check. What about running out of CUDA memory or other issues at runtime? Do they always ultimately cause the process to crash / end itself? Or would some more internal checking help to see if vLLM is actually still alive?
In any case, a simple wiring of
/health
AND/ready
also for the demoware doesn't hurt. It does not have to do much in the sense of actually determining some ready state. But having those "stub" endpoints and handlers implemented, is somewhat of an interface or blueprint for derived implementations to#put some code here#
Getting to the openai API server:
If we agree that this feature should be implemented somehow, what would be a good sources / starting point to determine vLLM is ready to receive requests? So to follow my initial thoughts, how do I determine if the model is downloaded and loaded / imported into the GPU, KV is setup?
Coming to your question (#6073 (comment)) @robertgshaw2-neuralmagic, I might be able to come up with a PR, but would love some more discussion on the implementation and "quality" of the reported status (be it vLLMs health or readiness). As for health, I'd like to expose as much of a proper health check as possible. Round-tripping a static API endpoint is nice, but does not go much beyond taking the existence of a running process as health check. What about running out of CUDA memory or other issues at runtime? Do they always ultimately cause the process to crash / end itself? Or would some more internal checking help to see if vLLM is actually still alive?
Regarding your question "So to follow my initial thoughts, how do I determine if the model is downloaded and loaded / imported into the GPU, KV is setup?", I have tried to used in this PR https://github.com/vllm-project/vllm/pull/7078 the "model_memory_usage" variable in model_runner object to determine the readiness once the model weights are loaded in the GPU memory.
any updates on this? would love to see this go through!
🚀 The feature, motivation and pitch
I am running vLLM instances on Kubernetes, as likely are others. Currently there only is the
/health
endpoint https://github.com/vllm-project/vllm/blob/15aba081f33e6d048422df6dcdb94301d08d13e6/vllm/entrypoints/openai/api_server.py#L88When defining health checks for workload on Kubernetes there are Liveness and Readiness probes (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes). While the liveness check determines if a process is still alive (extending on taking only the sheer existence of a process as health indication), the readiness check is used to determine if a pod is ready to receive requests.
The first issue with the health endpoint in the case of vLLM is that it only becomes available and returns HTTP 200 after the API server is started up. If you look at the example in the additional section, this is ~16 seconds after the vLLM was spawned, even without a model download and with a relatively small model only using one small GPU.
A liveness check is hard to configure properly with an unknown time it takes for a service to start providing the corresponding endpoint. If the timeouts are too tights (or the
initialDelaySeconds
are not high enough) the container will be restarted by the Kubelet. The Kubernetes way of dealing with this is the so called startup probe (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes). This is a probe that can have different intervals and timeouts and is used to cover the initial startup phase of a process. While this works somewhat, in practice there still are timeouts to define, so worse-cases to consider for vLLM pods to start up and then to move to their next lifecycle phase in which the other checks become active.My feature request therefore is twofold:
Introduce a
/ready
endpoint which indicated vLLM is ready for requests. This avoids having any particular timeouts, but allows Kubernetes to natively determine when vLLM is ready. Since this check is then done continuously, vLLM can also indicate at runtime when it's not ready anymore, e.g. in case of full queues or other tasks.Return the
/health
as early as possible and don't have it be influenced by a lengthy startup phase (such as larger model downloads, loading tensors into GPUs, ...). This enables Kubernetes to properly determine if vLLM is alive or needs a respawn.Alternatives
No response
Additional context