Prometheus metrics for docker registry & haproxy

jimmidyson commented 9 years ago

Most components expose metrics for ingestion into Prometheus. I'd like to see the same for haproxy & the docker registry.

HAProxy should be simple enough, running a container in the same pod using the Prometheus HAProxy exporter (https://github.com/prometheus/haproxy_exporter/).

Does the docker registry expose prometheus metrics natively?

smarterclayton commented 9 years ago

For the routers we'd also like to expose tenant metrics so we have a central gathering point. @ramr I would prefer to expose a stats endpoint via prometheus so that we don't have two stat gathering technologies at play. If you think there is something that would work better let me know - raw stats are not as effective and I'd like to correlate service+namespace to the traffic metric.

@ncdc on the registry, i think the answer is no but I think we should do so.

jimmidyson commented 9 years ago

We should be able to use Prometheus relabelling to add service & namespace as labels, as long as we can parse out that info from backed/frontend config. Alternatively it would be quite simple to write a custom bridge to add this metadata in.

smarterclayton commented 9 years ago

With routers we need to be sensitive to scale - we may have a hundred thousand or more routes to a single router pair, and in HA setups we'll want gather from both. We also need to be able to support metrics for other kinds of front ends like Apache and Nginx, even if we don't do that initial implementation. It seems like the router manager proc is going to sample the stats endpoint anyway.

Any alternative solution will have to be scalable and flexible in a similar way. I know there was a simple HAProxy scraper for Prometheus but I have no idea what it's gaps would be.

On Aug 9, 2015, at 12:21 PM, Jimmi Dyson notifications@github.com wrote:

We should be able to use Prometheus relabelling to add service & namespace as labels, as long as we can parse out that info from backed/frontend config. Alternatively it would be quite simple to write a custom bridge to add this metadata in.

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/3916#issuecomment-129212578.

ncdc commented 9 years ago

Correct, the registry doesn't have any Prometheus integration at the moment. It currently has reporting integration points with bugsnag and newrelic. What is needed to add support for Prometheus? What sort of data are you looking for?

jimmidyson commented 9 years ago

@ncdc Details of how to add Prometheus metrics & expose them are at https://godoc.org/github.com/prometheus/client_golang/prometheus. As for what data, I'm not really sure... Anything that can be used to monitor the performance of the registry - that requires knowledge of the internals of the registry I guess. Stuff like response times, number of images per namespace, storage used, etc sound like good candidates, but as I said anything that could be used to monitor the registry, both for alerting on issues & to build trends over time.

ncdc commented 9 years ago

@jimmidyson ok, we'll want to ultimately turn this into an upstream proposal for docker/distribution. At the very least, we could probably wrap the main app http.Handler with https://godoc.org/github.com/prometheus/client_golang/prometheus#InstrumentHandler, similar to how they already are doing for bugsnag and newrelic.

jimmidyson commented 9 years ago

@ncdc That sounds like a quick (hopefully easy) win.

jimmidyson commented 9 years ago

@smarterclayton Here's an example output from Prometheus exporter for HAProxy. There's only one route in there - fabric8 with only one endpoint - 172.17.0.5:9090. You can see that the metrics are labelled appropriately, e.g.:

haproxy_server_bytes_in_total{backend="be_http_default-fabric8",server="172.17.0.5:9090"} 22020

During prometheus relabelling when ingesting metrics, we could roll stats up to namespace (default in this case) & service (fabric8 in the this case), dropping labels we're not interested in, perhaps server (endpoint). We can also aggregate these metrics on ingestion so that we can have stats per namespace, etc. as required.

What do you think? Adding the prometheus haproxy_exporter as a sidecar container in the router pod would be simplest, although can also get it remotely if need be.

smarterclayton commented 9 years ago

Sidecar is a good place to start - because that decouples the router component from the Go code we use (that way you can switch to apache and you just need to get its own sidecar).

On Mon, Aug 10, 2015 at 3:11 PM, Jimmi Dyson notifications@github.com wrote:

@smarterclayton https://github.com/smarterclayton Here http://git.io/v3GTb's an example output from Prometheus exporter for HAProxy. There's only one route in there - fabric8 with only one endpoint

172.17.0.5:9090. You can see that the metrics are labelled appropriately, e.g.:

haproxy_server_bytes_in_total{backend="be_http_default-fabric8",server="172.17.0.5:9090"} 22020

During prometheus relabelling when ingesting metrics, we could roll stats up to namespace (default in this case) & service (fabric8 in the this case), dropping labels we're not interested in, perhaps server (endpoint). We can also aggregate these metrics on ingestion so that we can have stats per namespace, etc. as required.

What do you think? Adding the prometheus haproxy_exporter as a sidecar container in the router pod would be simplest, although can also get it remotely if need be.

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/3916#issuecomment-129569690.

Clayton Coleman | Lead Engineer, OpenShift

ramr commented 9 years ago

Just saw this - bad filter rules!! Yeah given that there maybe different router implementations - exposing the metrics via some standard interface ala prometheus is definitely better. Just fyi, we do expose the stats host/port for haproxy today, so collecting the metrics is easy enough with a prometheus ${router-type}.exporter sidecar container. Though that said, the main router command code is sorta generic and that creates the deployment configuration, so adding a sidecar container for one type of router (haproxy) and not for the other might be somewhat klunky. An alternative might be to run the infra router (which runs as the docker container watching for routes/endpoints and launches/reconfigures haproxy) with the collection sidecar code - for the specific plugin type - running in-process rather than outside as a sidecar. That might work better from a process management standpoint as well.

smarterclayton commented 9 years ago

Ultimately the router command probably should just be a template. It was kind of a bridge until we had service accounts and some other tools.

Where possible, I would prefer not to have to have code plugins for the router, because it requires a much higher bar for 3rd parties.

On Aug 10, 2015, at 4:33 PM, ramr notifications@github.com wrote:

Just saw this - bad filter rules!! Yeah given that there maybe different router implementations - exposing the metrics via some standard interface ala prometheus is definitely better. Just fyi, we do expose the stats host/port for haproxy today, so collecting the metrics is easy enough with a prometheus ${router-type}.exporter sidecar container. Though that said, the main router command code is sorta generic and that creates the deployment configuration, so adding a sidecar container for one type of router (haproxy) and not for the other might be somewhat klunky. An alternative might be to run the infra router (which runs as the docker container watching for routes/endpoints and launches/reconfigures haproxy) with the collection sidecar code - for the specific plugin type - running in-process rather than outside as a sidecar. That might work better from a process management standpoint as well.

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/3916#issuecomment-129598214.

jimmidyson commented 9 years ago

@ramr Using the stats HAProxy endpoint & Pormetheus haproxy_exporter as sidecar is exactly how I ingested metrics into Pormetheus - worked nicely & allows us to re-label metrics with namespace & service which is nice.

I prefer the idea of running the exporter as a sidecar container - for one thing, it allows us to swap/upgrade impls if need be without affecting the core infra router code. Also getting fixes/features into exporters as required (which I'm sure there will be) without vendoring & carrying in infra router is going to be simpler.

jimmidyson commented 9 years ago

@ramr @smarterclayton Any news on this? I'd like to get this in, but with the current implementation of oadm router cmd this is pretty tricky.

I could make the addition of the prometheus exporter sidecar optional via a flag (defaulted to true?). Also could only add the sidecar if there's a compatible sidecar for the router type so only for haproxy & nginx to begin with.

Thoughts?

ramr commented 9 years ago

@jimmidyson - I can only look at it sometime towards the end of next week. But that plan does that sound good - doing it only for the compatible router and a flag to add it in (the default am on the fence about - but true should be ok I think).

jimmidyson commented 9 years ago

We have metrics for the router now.

@ncdc Any thoughts on registry metrics?

ncdc commented 9 years ago

@pweil- @miminar for registry metrics ideas

ramr commented 8 years ago

@danmcp the router bits are complete - I guess the registry bits are pending, so can you please assign to @pweil- or @miminar Thx

miminar commented 8 years ago

There is already upstream request for providing prometheus metrics. Which was turned down. The upstream prefers to stay metrics backend agnostic and suggests to process registry log which contains all the information needed.

Registry's logging framework supports a wide variety of logging sinks. We could use another sidetrack container inside the registry pod to process the log and provide the metrics.

Also there are webhooks that could be used to gather metrics. I would have to make a deeper analysis because I'm not sure if it provides all the data needed.

Other ideas?

jimmidyson commented 8 years ago

Using logs sounds fine - might be worth looking at https://github.com/google/mtail?

smarterclayton commented 8 years ago

Can we easily convert expvar to prometheus? If not, let's just expose a simple prometheus endpoint and collect the metrics we do have.

jimmidyson commented 8 years ago

Prometheus does have an expvar collector (https://godoc.org/github.com/prometheus/client_golang/prometheus#ExpvarCollector):

ExpvarCollector collects metrics from the expvar interface. It provides a quick way to expose numeric values that are already exported via expvar as Prometheus metrics. Note that the data models of expvar and Prometheus are fundamentally different, and that the ExpvarCollector is inherently slow. Thus, the ExpvarCollector is probably great for experiments and prototying, but you should seriously consider a more direct implementation of Prometheus metrics for monitoring production systems.

I guess we'd need to quantify what slow means & what the impact is. It's a shame we can't do more direct instrumentation of course.

smarterclayton commented 8 years ago

We could simply have the prometheus expvar collector shim inside of the registry code.

On Thu, Feb 4, 2016 at 3:17 AM, Jimmi Dyson notifications@github.com wrote:

Prometheus does have an expvar collector ( https://godoc.org/github.com/prometheus/client_golang/prometheus#ExpvarCollector ):

ExpvarCollector collects metrics from the expvar interface. It provides a quick way to expose numeric values that are already exported via expvar as Prometheus metrics. Note that the data models of expvar and Prometheus are fundamentally different, and that the ExpvarCollector is inherently slow. Thus, the ExpvarCollector is probably great for experiments and prototying, but you should seriously consider a more direct implementation of Prometheus metrics for monitoring production systems.

I guess we'd need to quantify what slow means & what the impact is. It's a shame we can't do more direct instrumentation of course.

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/3916#issuecomment-179704488.

smarterclayton commented 7 years ago

Router now has metrics as of v3.6.0-alpha.1. Registry is in the process of getting some.

pweil- commented 7 years ago

registry metrics implemented in https://github.com/openshift/origin/pull/12711

openshift / origin

Prometheus metrics for docker registry & haproxy #3916