influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.67k stars 5.59k forks source link

How to use Telegraf in a Docker cluster environment #462

Closed johnrengelman closed 8 years ago

johnrengelman commented 8 years ago

I'm considering using Telegraf for gathering metrics in my Docker cluster (using Rancher), but I'm running in to a roadblock for one item in particular.

Many of our applications are Java apps, that will be package up into containers and deployed. However, the support for the JVM in Telegraf is via the Jolokia plugin which uses a "pull" method (i.e. it needs to know which end points to connect to to gather data).

In a cluster environment, I don't have a way of knowing this up front. Even if I could dynamically provision the Telegraf config when a new container is scheduled, it would required bouncing the Telegraf container and that could result in dropped metrics.

I'm wonder what would be the best way to support this. Is there a mechanism that I could use to run Telegraf in a more of a "push" mode when individual applications (specifically JVM ones) could push their metrics into it?

johnrengelman commented 8 years ago

A possible solution that I'm going to explore is using the statsd plugin for Telegraf so that it can't accept that input and then run a "sidekick" container to each of my JVM containers that uses JMXTrans to push the JMX data to Telegraf which can then forward it on.

sparrc commented 8 years ago

@johnrengelman Running the Telegraf Statsd server is one way to solve this problem. Another way would be to use a message queue, such as kafka. Your apps could push line-protocol into kafka and then you can use Telegraf's kafka_consumer plugin to read those messages from kafka.

Another thing you might consider is if Telegraf really needs to be a part of your stack. Looking at JMXTrans, it appears that it already supports graphite output, and InfluxDB supports a graphite input, so you could just forward your metrics directly from JMXTrans -> InfluxDB.

johnrengelman commented 8 years ago

All good points. I'll look into those. My initial reaction to sending metrics straight to InfluxDb is that I want to utilize Telegraf in the middle to apply some consistent tags to the data (datacenter, region, instance Id, etc). I don't want to have to replicate this data over and over. I was also thinking it would be a good central place to do any filtering, so instead of having the developers decide on which metrics to send, they could just blast everything toward Telegraf and we could drop metrics there if needed.

johnrengelman commented 8 years ago

I also want to reduce the complexity for my developers in that I want to enforce that certain standard JVM metrics are sent and they only need to worry about custom application metrics for their app. So that's a big piece of puzzle as well.

zstyblik commented 8 years ago

@johnrengelman Telegraf should work just fine in Docker container. You likely want to mount Hosts' /proc and /sys and tell Telegraf where to look for it via HOST_PROC and HOST_SYS environment variables. Then, you want to mount docker.socket. The only thing that's not going to work, or so I think, is % lsof; for collecting info about TCP/IP state.

johnrengelman commented 8 years ago

Running Telegraf in a Docker container wasn't what I was concerned about. I'm more interested in the best ways to collect metrics from containers in the dynamic nature of a cluster where I won't know which hosts are running which containers or even how many instances of each container might be running (scaling).

sparrc commented 8 years ago

@johnrengelman Makes sense, it would make sense for telegraf to have a generic tcp & udp line-protocol listener for this sort of aggregation.

sparrc commented 8 years ago

I think this is resolved, @johnrengelman feel free to re-open if there is anything else

johnrengelman commented 8 years ago

Should we create a ticket for a generic tcp/udp line-protocol listener?

sparrc commented 8 years ago

@johnrengelman sure, please do!

pkid commented 8 years ago

@johnrengelman do you have an example of what have you done with the TCP/UDP line-protocol listener? We want to gather some spring boot actuator metrics and then export them to Influxdb. What we are doing now is just have a scheduled task to write the metrices(say every 5 seconds) to Influxdb, but we want to have a more "advanced" approach, especially when we having multiple instances scaled up automatically. It would be great to share your solution. Thank you!

johnrengelman commented 7 years ago

@pkid what we do is run a telegraf agent on every one of our hosts in our cluster (like a daemon-set in k8s) that has the udp or tcp listener running. then each of our containers is set up to output metrics to this host port. We use a forked copy of https://github.com/iZettle/dropwizard-metrics-influxdb to publish from dropwizard-metrics in JVM apps (Spring Boot and Ratpack) to the telegraf agent in line protocol format.