Open roelschr opened 1 year ago
yes you should be able to setup observability using the ray serve general guides.
For the custom metrics, you can use "ray_aviary" to search the metrics (eg. if you're using grafana), we have most of the ones used in the blogs available.
Thanks @akshay-anyscale , found all of them very useful!
But I'm seeing some discrepancy between the times reported by ray-llm prometheus metrics and by llmperf
. While evaluating a llama2-7b
deployment I see llmperf
reporting ITL (out): 36.62 ms/token
, but in Grafana, the avg (reported with rate(ray_aviary_router_get_response_stream_per_token_latency_ms_sum[$__rate_interval])/rate(ray_aviary_router_get_response_stream_per_token_latency_ms_count[$__rate_interval])
) is around 101ms.
~I suspect per_token_latency
includes time for the first token. Do you have any idea why only this metric seems different?~ I see token_latency
clock being reset after the first token here. But I wonder whether the time to process during yielding the first token is what may be causing this inconsistency. In any case, I've check the time with other tools besides llmperf
and they all match at around 35ms.
Hi @roelschr I believe this is because we do some batching(upto 100ms) to make the streaming more efficient. If you make the denominator "ray_aviary_tokens_generated" instead, this should be closer to the llm perf value - the denominator would be off by 1 though because of the first token.
I assume, because RayLLM runs on top of Ray Serve, I can follow these steps to get observability for LLM deployments (Kuberay).
But how can we get custom metrics that are specific to LLMs, like the ones that are being suggested by the Ray team itself: https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference#benchmarking-results-for-per-token-llm-products