LLM Deployment Observability

roelschr commented 1 year ago

I assume, because RayLLM runs on top of Ray Serve, I can follow these steps to get observability for LLM deployments (Kuberay).

But how can we get custom metrics that are specific to LLMs, like the ones that are being suggested by the Ray team itself: https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference#benchmarking-results-for-per-token-llm-products

akshay-anyscale commented 1 year ago

yes you should be able to setup observability using the ray serve general guides.

For the custom metrics, you can use "ray_aviary" to search the metrics (eg. if you're using grafana), we have most of the ones used in the blogs available.

roelschr commented 1 year ago

Thanks @akshay-anyscale , found all of them very useful!

But I'm seeing some discrepancy between the times reported by ray-llm prometheus metrics and by llmperf. While evaluating a llama2-7b deployment I see llmperf reporting ITL (out): 36.62 ms/token, but in Grafana, the avg (reported with rate(ray_aviary_router_get_response_stream_per_token_latency_ms_sum[$__rate_interval])/rate(ray_aviary_router_get_response_stream_per_token_latency_ms_count[$__rate_interval])) is around 101ms.

~I suspect per_token_latency includes time for the first token. Do you have any idea why only this metric seems different?~ I see token_latency clock being reset after the first token here. But I wonder whether the time to process during yielding the first token is what may be causing this inconsistency. In any case, I've check the time with other tools besides llmperf and they all match at around 35ms.

akshay-anyscale commented 1 year ago

Hi @roelschr I believe this is because we do some batching(upto 100ms) to make the streaming more efficient. If you make the denominator "ray_aviary_tokens_generated" instead, this should be closer to the llm perf value - the denominator would be off by 1 though because of the first token.

ray-project / ray-llm

LLM Deployment Observability #90