vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.62k stars 3.9k forks source link

[Usage]: Get time statistics with each request #4683

Open arunpatala opened 4 months ago

arunpatala commented 4 months ago

I would like to know if there is a way to get usage statistics with each request (maybe with a flag parameter):

I would like to know queue wait time, num_prompt_tokens, num_generated_tokens, time for prefill stage, time for decoding stage etc returned with each request.

If it doesnt already exist, please point me how i can add such a feature.

Thanks

simon-mo commented 4 months ago

This can be useful indeed. Ideally we should add it to both the LLM offline inference API (as part of RequestOutput) and online API server (through headers).

I would recommend looking at the metrics in code path in LLMEngine

https://github.com/vllm-project/vllm/blob/f6a593093ac201c286e99a849091801a88d83622/vllm/engine/llm_engine.py#L525

The ideal place to store these information would be inside RequestOutput.

arunpatala commented 4 months ago

thanks. I will have a look and try to understand how to add the metrics.