Closed 2color closed 3 months ago
I just discovered there's already a bunch of histograms for HTTP requests in Boxo
The only thing missing from them is the status code which would allow us to look at aggregate request latencies grouped by HTTP response code.
I think the boxo ones are not the best place, because afaik we update them only on success (example). It might be way too involved to try to add status code support to them, and also updating on error may change semantic meaning and the way people used them.
As a rule of thumb, we don't want to break monitoring people have set up with existing metrics.
Been a while I looked into metrics, but perhaps add new histograms here (in rainbow), under new names (e.g. request_size_bytes
→ hist_request_size_bytes
) and keep old ones (request_size_bytes
) as deprecated at least for one release, allowing users to migrate? cc @hacdias if any concerns
I think the boxo ones are not the best place, because afaik we update them only on success (example). It might be way too involved to try to add status code support to them, and also updating on error may change semantic meaning and the way people used them.
Yeah I also just noticed that the ipfs_http_gw_get_duration_seconds
metric you mentioned is indeed only for successful responses.
Sounds good to me to add new histograms here and remove summaries in a couple of releases.
It's worth noting that when Prometheus scrapes all metrics by default this will result in quite a few more metrics which can be heavy on Prometheus. But given the number of metrics we expose I don't think this will make a massive difference. 🤷♂️
Problem
We instrument the code with Prometheus Summaries to measure:
https://github.com/ipfs/rainbow/blob/6832d41820c1793e273e34b3eee3197dd23598b4/metrics.go#L79-L92
Unfortunately, summaries cannot be aggregated which means we cannot calculate quantile calculations for all rainbow instances in the public waterworks infrastructure.
Suggested solution
Switch these over to use Histograms
https://www.robustperception.io/how-does-a-prometheus-histogram-work/ https://prometheus.io/docs/practices/histograms/