ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.51k stars 5.69k forks source link

[serve] High cardinality for metrics that include the HTTP route #47999

Open edoakes opened 1 week ago

edoakes commented 1 week ago

In some of our metrics, we include the HTTP route as a tag. If users include data with high cardinality in their HTTP requests, such as a per-user ID, this blows up the prometheus metrics (and can render the metrics unusable).

We need to reduce the cardinality here, perhaps by only exporting the Serve-level route_prefix instead of the full route.

antoniomdk commented 1 week ago

+1 to this. I think apart from worsening the UX, for users than ingest the metrics into other providers rather than hosting a Prometheus server, high cardinality metrics can blow up costs, e.g. https://docs.datadoghq.com/account_management/billing/custom_metrics/?tab=countrate

edoakes commented 5 days ago

Ok, I did some prototyping here. We have metrics containing the route in two places: the proxy and the replica.

In the proxy, we don't have access to application-defined routes (by design) so we can't do anything too clever. We could try to do something like auto-detect the cardinality and cap the number of tags, but that seems excessively complex.

In the replica, we do have access to the underlying ASGI app which we can use to identify the matched route string (e.g., /path/{wildcard}.

So I'd propose that we:

We could consider changing the metric tag for the proxy metrics to route_prefix for clarity, but that introduces a migration for what seems to me like a very minor improvement.