Rainbow request latency and size metrics cannot be aggregated

ipfs / rainbow

A specialized IPFS HTTP gateway

https://docs.ipfs.tech/reference/http/gateway/

Other

66 stars 11 forks source link

Rainbow request latency and size metrics cannot be aggregated #107

Closed 2color closed 3 months ago

2color commented 4 months ago

Problem

We instrument the code with Prometheus Summaries to measure:

duration/latency of requests
request and response byte size

https://github.com/ipfs/rainbow/blob/6832d41820c1793e273e34b3eee3197dd23598b4/metrics.go#L79-L92

Unfortunately, summaries cannot be aggregated which means we cannot calculate quantile calculations for all rainbow instances in the public waterworks infrastructure.

Suggested solution

Switch these over to use Histograms

https://www.robustperception.io/how-does-a-prometheus-histogram-work/ https://prometheus.io/docs/practices/histograms/

2color commented 4 months ago

I just discovered there's already a bunch of histograms for HTTP requests in Boxo

The only thing missing from them is the status code which would allow us to look at aggregate request latencies grouped by HTTP response code.

lidel commented 4 months ago

I think the boxo ones are not the best place, because afaik we update them only on success (example). It might be way too involved to try to add status code support to them, and also updating on error may change semantic meaning and the way people used them.

As a rule of thumb, we don't want to break monitoring people have set up with existing metrics.

Been a while I looked into metrics, but perhaps add new histograms here (in rainbow), under new names (e.g. request_size_bytes → hist_request_size_bytes) and keep old ones (request_size_bytes) as deprecated at least for one release, allowing users to migrate? cc @hacdias if any concerns

2color commented 4 months ago

I think the boxo ones are not the best place, because afaik we update them only on success (example). It might be way too involved to try to add status code support to them, and also updating on error may change semantic meaning and the way people used them.

Yeah I also just noticed that the ipfs_http_gw_get_duration_seconds metric you mentioned is indeed only for successful responses.

Sounds good to me to add new histograms here and remove summaries in a couple of releases.

It's worth noting that when Prometheus scrapes all metrics by default this will result in quite a few more metrics which can be heavy on Prometheus. But given the number of metrics we expose I don't think this will make a massive difference. 🤷‍♂️