vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.83k stars 4.11k forks source link

[Feature]: Add readiness endpoint /ready and return /health earlier (vLLM on Kubernetes) #6073

Open frittentheke opened 3 months ago

frittentheke commented 3 months ago

🚀 The feature, motivation and pitch

I am running vLLM instances on Kubernetes, as likely are others. Currently there only is the /health endpoint https://github.com/vllm-project/vllm/blob/15aba081f33e6d048422df6dcdb94301d08d13e6/vllm/entrypoints/openai/api_server.py#L88

When defining health checks for workload on Kubernetes there are Liveness and Readiness probes (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes). While the liveness check determines if a process is still alive (extending on taking only the sheer existence of a process as health indication), the readiness check is used to determine if a pod is ready to receive requests.

The first issue with the health endpoint in the case of vLLM is that it only becomes available and returns HTTP 200 after the API server is started up. If you look at the example in the additional section, this is ~16 seconds after the vLLM was spawned, even without a model download and with a relatively small model only using one small GPU.

A liveness check is hard to configure properly with an unknown time it takes for a service to start providing the corresponding endpoint. If the timeouts are too tights (or the initialDelaySeconds are not high enough) the container will be restarted by the Kubelet. The Kubernetes way of dealing with this is the so called startup probe (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes). This is a probe that can have different intervals and timeouts and is used to cover the initial startup phase of a process. While this works somewhat, in practice there still are timeouts to define, so worse-cases to consider for vLLM pods to start up and then to move to their next lifecycle phase in which the other checks become active.

My feature request therefore is twofold:

  1. Introduce a /ready endpoint which indicated vLLM is ready for requests. This avoids having any particular timeouts, but allows Kubernetes to natively determine when vLLM is ready. Since this check is then done continuously, vLLM can also indicate at runtime when it's not ready anymore, e.g. in case of full queues or other tasks.

  2. Return the /health as early as possible and don't have it be influenced by a lengthy startup phase (such as larger model downloads, loading tensors into GPUs, ...). This enables Kubernetes to properly determine if vLLM is alive or needs a respawn.

Alternatives

No response

Additional context

INFO 07-02 15:27:33 api_server.py:177] vLLM API server version 0.5.0.post1                                                                                                                                                                                                               
INFO 07-02 15:27:33 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.               
  warnings.warn(                                                                                                                                                                                                                                                                         
INFO 07-02 15:27:34 config.py:1197] Casting torch.float32 to torch.float16.                                                                                                                                                                                                              
INFO 07-02 15:27:34 config.py:1218] Downcasting torch.float32 to torch.float16.                                                                                                                                                                                                          
INFO 07-02 15:27:34 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='REDACTED', speculative_config=None, tokenizer='REDACTED', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.                                                                                                                                                                    
INFO 07-02 15:27:37 weight_utils.py:218] Using model weights format ['*.safetensors']                                                                                                                                                                                                    
INFO 07-02 15:27:41 model_runner.py:160] Loading model weights took 123.4567 GB                                                                                                                                                                                                           
INFO 07-02 15:27:42 gpu_executor.py:83] # GPU blocks: 1234, # CPU blocks: 512                                                                                                                                                                                                            
INFO 07-02 15:27:43 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.                                         
INFO 07-02 15:27:43 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.      
INFO 07-02 15:27:49 model_runner.py:965] Graph capturing finished in 6 secs.                                                                                                                                                                                                             
[...]                                                                                                                                                                              
INFO:     Started server process [1]                                                                                                                                                                                                                                                     
INFO:     Waiting for application startup.                                                                                                                                                                                                                                               
INFO:     Application startup complete.                                                                                                                                                                                                                                                  
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)                                                                                                                                                                                                                  
INFO:     127.0.0.1:5000 - "GET /health HTTP/1.1" 200 OK                                                                                   
robertgshaw2-neuralmagic commented 3 months ago

This is a good idea.

Do you have capacity to implement it?

richarddli commented 2 months ago

@robertgshaw2-neuralmagic Note that a previous PR for this functionality was rejected, see https://github.com/vllm-project/vllm/pull/1244.

I'm not sure if the goal of api_server is just "demo ware you should fork", as suggested in that PR, or if it should be "simple, but production-ready".

robertgshaw2-neuralmagic commented 2 months ago
frittentheke commented 2 months ago

In any case, a simple wiring of /health AND /ready also for the demoware doesn't hurt. It does not have to do much in the sense of actually determining some ready state. But having those "stub" endpoints and handlers implemented, is somewhat of an interface or blueprint for derived implementations to #put some code here#

Getting to the openai API server:

If we agree that this feature should be implemented somehow, what would be a good sources / starting point to determine vLLM is ready to receive requests? So to follow my initial thoughts, how do I determine if the model is downloaded and loaded / imported into the GPU, KV is setup?

Coming to your question (https://github.com/vllm-project/vllm/issues/6073#issuecomment-2205941456) @robertgshaw2-neuralmagic, I might be able to come up with a PR, but would love some more discussion on the implementation and "quality" of the reported status (be it vLLMs health or readiness). As for health, I'd like to expose as much of a proper health check as possible. Round-tripping a static API endpoint is nice, but does not go much beyond taking the existence of a running process as health check. What about running out of CUDA memory or other issues at runtime? Do they always ultimately cause the process to crash / end itself? Or would some more internal checking help to see if vLLM is actually still alive?

mfournioux commented 2 months ago

In any case, a simple wiring of /health AND /ready also for the demoware doesn't hurt. It does not have to do much in the sense of actually determining some ready state. But having those "stub" endpoints and handlers implemented, is somewhat of an interface or blueprint for derived implementations to #put some code here#

Getting to the openai API server:

If we agree that this feature should be implemented somehow, what would be a good sources / starting point to determine vLLM is ready to receive requests? So to follow my initial thoughts, how do I determine if the model is downloaded and loaded / imported into the GPU, KV is setup?

Coming to your question (#6073 (comment)) @robertgshaw2-neuralmagic, I might be able to come up with a PR, but would love some more discussion on the implementation and "quality" of the reported status (be it vLLMs health or readiness). As for health, I'd like to expose as much of a proper health check as possible. Round-tripping a static API endpoint is nice, but does not go much beyond taking the existence of a running process as health check. What about running out of CUDA memory or other issues at runtime? Do they always ultimately cause the process to crash / end itself? Or would some more internal checking help to see if vLLM is actually still alive?

Regarding your question "So to follow my initial thoughts, how do I determine if the model is downloaded and loaded / imported into the GPU, KV is setup?", I have tried to used in this PR https://github.com/vllm-project/vllm/pull/7078 the "model_memory_usage" variable in model_runner object to determine the readiness once the model weights are loaded in the GPU memory.

epark001 commented 1 week ago

any updates on this? would love to see this go through!