Open arunpatala opened 4 months ago
This can be useful indeed. Ideally we should add it to both the LLM offline inference API (as part of RequestOutput
) and online API server (through headers).
I would recommend looking at the metrics in code path in LLMEngine
The ideal place to store these information would be inside RequestOutput.
thanks. I will have a look and try to understand how to add the metrics.
I would like to know if there is a way to get usage statistics with each request (maybe with a flag parameter):
I would like to know queue wait time, num_prompt_tokens, num_generated_tokens, time for prefill stage, time for decoding stage etc returned with each request.
If it doesnt already exist, please point me how i can add such a feature.
Thanks