Endpoint Aggregator and Docker Stats Collector

alexrudd commented 8 years ago

Hey, I wrote a couple of collectors for a work project and thought I'd check what interest there is for merging them back into the main repo.

I've seen it argued that both these features are technically provided by cAdvisor, but there are various reasons why I chose to implement them as node_exporter collectors instead, namely:

cAdvisor's high resource overhead
Usability issues with cAdvisor's "application metrics" solution
node_exporter's much simpler and easier to modify architecture
node_exporter's ability to enable only the collectors you want
node_exporter's extremely low resource usage

Prometheus Endpoint Aggregator/Forwarder:

This collector reads in a json file of Prometheus endpoints and associated labels, then scrapes all these endpoints, applies the associated labels, and re-exposes them via the single node_exporter endpoint (with or without the "nodecollector..." namespace)

This was written to solve the case where a dynamic range of dockerized apps are running on a single host, with each exposing their own Prometheus metrics published by docker to a random host port. A simple script queries the docker api and builds the json file for the collector to read.

cAdvisor does claim to have the ability to re-expose container's Prometheus metrics. The process to do so is horribly involved, and only makes them accessible via the REST api and not the Prometheus /metrics page.

Docker Stats:

This collector uses the Docker Engine-API to query the stats of all running containers. Currently it only exposes three of them (cpu total usage, memory usage, memory limit) but could easily be extended to expose all the available stats.

I haven't looked into exactly why, but the particular docker stats api request is incredibly slow (1.9s) luckily for my uses this isn't a problem, but some sort of preemptive caching could probably make this better.

Again this is something which cAdvisor already does very well, but we found the resource overhead of running cAdvisor to monitor numerous containers was unreasonable (>10% CPU usage if you're using the dashboard) for the few stats we actually cared about, whereas we found node_exporter sits at a steady 1%.

State of collector implementations:

Both these collectors were implemented very quickly and undoubtedly have many bugs and areas for improvement. I'm creating this issue to see what people think and to decide whether it's worth me putting any more time into this.

Thanks, Alex

brian-brazil commented 8 years ago

Prometheus Endpoint Aggregator/Forwarder

This is considered an an anti-pattern in Prometheus, we recommend using service discovery at the Prometheus level instead.

Docker Stats

I think the main question here is if we want to start reinventing cAdvisor.

but we found the resource overhead of running cAdvisor to monitor numerous containers was unreasonable

That's a bit surprising to me. Have you filed a bug with cAdvisor?

we found node_exporter sits at a steady 1%.

That's high cpu usage for the node exporter, I'd expect nearer 0.1%

alexrudd commented 8 years ago

Thanks for the feedback. What are the reasons against aggregating Prometheus metrics? We have a steady turnover of containers and don't use any centralised service discovery that could be shared with our Prometheus server.

Even if we did I don't like the idea of opening large ranges of ingress ports (32768-60999) just to make sure Prometheus can access any possible port which the metrics may have been published to.

For me, node_exporter can do things which cAdvisor can't and cAvidor can do things which node_exporter can't. I'd rather not be running both, and node_exporter was the easier to modify and maintain. Cgroups are in common use, and it seems an omission that node_exporter doesn't cover these, the docker collector was a quick way to get most of what I wanted from cAdvisor.

The excessive resource usage of cAdvisor seemed to only be while we were viewing the dashboard, but as that was the only way to see custom application metrics, it meant we were using it a lot. If there were some centralised dash and cAdvisor was only collecting, then usage might have been much lower.

Just checked, node_exporter is using between 0.3% and 1.0% cpu during collection on one of our hosts, though these are quite small instance types.

brian-brazil commented 8 years ago

What are the reasons against aggregating Prometheus metrics?

We believe in doing service discovery in Prometheus, and handling aggregation also in Prometheus. A single on-host daemon is a bottleneck both technically and operationally.

and don't use any centralised service discovery that could be shared with our Prometheus server.

How do your services find each other?

I'd rather not be running both, and node_exporter was the easier to modify and maintain.

That cgroups isn't handled by node exporter is mainly a historical accident. Cadvisor already had all that support, so we saw no need to duplicate it.

The excessive resource usage of cAdvisor seemed to only be while we were viewing the dashboard, but as that was the only way to see custom application metrics, it meant we were using it a lot. If there were some centralised dash and cAdvisor was only collecting, then usage might have been much lower.

That makes more sense, especially on small instances. Sticking with just Prometheus on top of cadvisor sounds like it's an option then.

alexrudd commented 8 years ago

Okay, I see the reasoning behind that.

How do your services find each other?

they don't as such. There is some state stored by a message router but this doesn't include all services; there's also some state in an orchestrator but this purposefully doesn't include container level information. I could write something that discovers all the endpoints of the currently running containers and pushes that to the Prometheus server's config, but I'd still have to open up a large port range to the container hosts, and it would be a lot more work.

Sticking with just Prometheus on top of cadvisor sounds like it's an option then.

That still wouldn't give us the custom application stats, we did try modifying cAdvisor to include the custom app metrics in the /metrics Prometheus endpoint, but we still had to deal with the weird way of telling cAdvisor how to collect those metrics from their respective containers.

It looks like we'll probably keep running both these collectors on our own fork for now, unless I can figure out a way (that works for us) of communicating all endpoints to Prometheus and dropping the aggregator.

Is there a particular reason a non-default docker collector wouldn't be considered? It seems like it would fill a similar niche as the supervisord collector, and Prometheus monitoring does seem to be quite prevalent in the Docker + Go community.

brian-brazil commented 8 years ago

unless I can figure out a way (that works for us) of communicating all endpoints to Prometheus and dropping the aggregator.

Consul is popular.

Is there a particular reason a non-default docker collector wouldn't be considered?

I don't think we'd accept a Docker module in the node exporter as that'd probably be better as a separate exporter. A cgroups module could come to pass if cadvisor were to become nonviable.

alexrudd commented 8 years ago

Okay, will think more on this.

Thanks!

ZhenyangZhao commented 8 years ago

HI: @brian-brazil @AlexRudd I have a question,hope you can help me. Why prometheus use cadvisor instead of container_exporter? Is there another way to use prometheus monitor docker container without use cadvisor?

Thanks all

alexrudd commented 8 years ago

Hey, I think I tried container_exporter but it couldn't handle monitoring more than a handful of running containers without failing. Also it was written using fsouza/go-dockerclient instead of Docker's own Engine-API (Not a huge problem but I'd rather be using the official client when available)

cAdvisor provides a lot (maybe all?) of the same stats and isn't limited to just containers being run by Docker. Also it's being actively developed and maintained.

ZhenyangZhao commented 8 years ago

@AlexRudd Thanks a lot.

jescarri commented 7 years ago

So what would be the best solution if I have only one port for prometheus exporters, is there any good way of using only a single port and be able to use multiple exporters?, aggregation of all the exporters into a single page is against prometheus design/usage patterns?

brian-brazil commented 7 years ago

aggregation of all the exporters into a single page is against prometheus design/usage patterns?

Yes, you need a port each or some form of reverse proxy.

jescarri commented 7 years ago

Thanks!

prometheus / node_exporter

Endpoint Aggregator and Docker Stats Collector #257