influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.14k stars 5.52k forks source link

Add integration for service discovery & kv config stores (dynamic config) #272

Open pauldix opened 8 years ago

pauldix commented 8 years ago

If there's some standard service discovery to connect to like Consul, it would be cool to have Telegraf connect to that and automatically start collecting data for services that Telegraf supports.

So when a new MySQL server comes on, Telegraf will automatically start collecting data from it.

Just an idea. Users could also get this by just having Telegraf part of their deploys when they create new servers.

sparrc commented 8 years ago

:+1:

rvrignaud commented 8 years ago

Hello @pauldix, What you suggest is, I think, what I tried to explain here: https://github.com/influxdb/telegraf/issues/193#issuecomment-140688471 Prometheus supports a wide range of discovery (consul included). I'm personally interested in kubernetes discovery.

titilambert commented 8 years ago

@pauldix @rvrignaud see PR about etcd here : https://github.com/influxdata/telegraf/pull/651

chris-zen commented 8 years ago

Hi @titilambert, your PR is really useful to update telegraf configuration dynamically, such as changing input and outputs configurations from time to time, but for service discovery in a system such as AWS, mesos or kubernetes where things scale dynamically, something like the service discovery features implemented in prometheus would be really great.

@rvrignaud explanation is here, and the prometheus documentation shows the different possibilities supported.

Having this feature would definitively make me move to influxdb, but keep using the prometheus instrumentation library.

titilambert commented 8 years ago

@chris-zen that's very interesting ! I'm agree with you, I would love to see that, but this kind of service discovery is more for scheduled (polling) monitoring systems (like Prometheus), isn't it ? I dont know if a decentralized (pushing) system like Telegraf is adapted to this...

What do you think about?

chris-zen commented 8 years ago

Yes, agree that it is specially important for polling. But telegraf is already supporting polling inputs such as the one for prometheus. Right now the prometheus input only allows static config, but it would be very useful to support service discovery too. My understanding is that telegraf is quite versatile and allows both pull and push models, but the pull model without service discovery is worthless in such dynamic environments.

sparrc commented 8 years ago

Just dropping this here for reference on what I think is a good service discovery model (from prometheus): https://prometheus.io/blog/2015/06/01/advanced-service-discovery/. Same as mentioned above but I think this blog post is a little more approachable than their documentation.

I think that the "file-based" custom service discovery will be easy to implement. Doing DNS-SERV, Consul, etc. will take a bit more work, but certainly doable.

I'm imagining some sort of plugin system for these, where notifications on config changes and additions could be sent down a channel, and whenever Telegraf detects one of these it would apply and reload the configuration.

sparrc commented 8 years ago

My preference would be to start with a simple file & directory service discovery. This would be an inotify goroutine that would basically send a service reload (SIGHUP) to the process when it detects a change in any config file, or any config file added or removed to a config directory.

This could be extended using https://github.com/docker/libkv or something similar that would launch a goroutine that would overwrite the on-disk config file(s) when it detects a change (basically a very simple version of confd)

This would solve some of the issues that I have (and that @johnrengelman and @balboah raised) with integrating with a kv-store. In essence, we wouldn't be dependent on a kv-store, and we wouldn't have any confusion over the currently-loaded config, because the config would always also be on-disk.

sparrc commented 8 years ago

curious what others think of this design, I'm biased but this is my view:

pros:

cons:

pauldix commented 8 years ago

I like it. That was one thing that used to be tricky with Redis. You could make commands to alter the running config, but then if you restarted your server without updating the on disk config then you're hosed.

File write isn't a big deal. Not like they're going to be updating the config multiple times a second, minute, or even hour.

sofixa commented 8 years ago

@pauldix You might be updating your config multiple times per hour and up if you are in a highly dynamic environment, like an AWS Autoscaling Group or a Docker Swarm/Kubernetes/fleetd/LXD container thingie. But even then, @sparrc 's proposed implementation sounds very good, combining flexibility with resiliency (you aren't depending on your KV/network always being up). +1

panda87 commented 8 years ago

Hi guys, any updates with this monitoring methodology? My company starts to implement mesos and marathon as scheduler and we find the services monitoring (mysql,es etc.) very difficult with the current telegraf monitoring architecture and it seems that the only way right now is use Prometheus as you mentioned above because of the support in dynamic SD monitoring.

@sparrc can you please share the current state design?

Thanks

toni-moreno commented 7 years ago

Hi to everybody , I'm new to this discussion and I would like to add my Point of view.

Everybody knows how important is now add ability to our agents to get configuration and discover configuration change from a centralized configuration system on our systems.

As I have been read in this thread ( and others https://github.com/influxdata/telegraf/pull/651) , there diferent ways to got remote configuration.

https://github.com/docker/libkv ( for etcd or other KV store backends) https://github.com/spf13/viper ( for remote config storage)

Any way the most important thing ( IMHO ) is add the ability to manage easily changes on all our distributed agents. I think when there is not any available solution the easiest way should be the best. So I did yesterday a really simple proposal on https://github.com/influxdata/telegraf/issues/1496, that could be easily coded in a few lines of code. ( the same behaviour if you can switch to the https://github.com/spf13/viper library).

Once added this simple feature , we'll can continue discussion on other more sophisticated way to get configurations and integration with know centralized systems. ( like etcd, and others).

I vote for add first a simple centralized way and after an integrated solution. Both will cover the same functionality on different scenarios.

what do you think about?

sparrc commented 7 years ago

@toni-moreno the most simple way to manage it is via files. Although the http getting might be simple for your scenario, I can imagine ways in which it can get complicated (just see the httpjson plugin for examples). Like I said, this feature needs to first be coded as a file watcher and then we can develop plugins around changing the on-disk file(s).

blaggacao commented 7 years ago

There is one commonly used abstraction pattern available, the only thing what would be needed is hot config reloading:

https://github.com/kelseyhightower/confd/ is a single binary which watches any (many) kind(s) of backend(s) and templates the configuration file upon detected changes.

I'm about to implement something for rancher catalogue items. https://github.com/influxdata/influxdata-docker/pull/9 is related.

The pattern is rather simple to manage with sidekicks and shared volumes.


One step further:

@sparrc I think this is almost a no brainer, as only the signalling to the telegraf process would need some extra thought, the rest is taken care of.

sparrc commented 7 years ago

the signaling would simply be the file changing on disk, there is no need for confd to directly signal to Telegraf as far as I understand it.

blaggacao commented 7 years ago

Absolutely right.

panda87 commented 7 years ago

@sparrc Hi sparrc, any new updates on this?

3fr61n commented 7 years ago

Hi guys, very interesting discussion, I'm totally agree with having telegraf 'separate' of etcd/viper/etc, however it needs somehow track any file changes performed for those apps, and being able to apply those changes 'on-the-fly'.

Does anyone knows if this is going to be the way to go, and how is going to be implemented?

sparrc commented 7 years ago

@3fr61n, yes, the initial implementation will be a file/directory watcher that will be able to dynamically reload the configuration any time that the file(s) change.

I'm not sure the "how" yet, maybe this: https://github.com/fsnotify/fsnotify

chopj commented 7 years ago

Any updates on this feature? What is the recommended way to do service/dynamic config discovery today? I'm curious what third-party solutions people using if this is not natively supported?

opsnull commented 7 years ago

any update?

danielnelson commented 6 years ago

This is something I am working on. The current plan is similar to described above, but instead of using inotify it will continue to require a signal to trigger config changes. Once this is done we should be able to work on creating more elaborate configuration plugins.

3fr61n commented 6 years ago

Hi

In the consul branch we have some proof of concept

Each time any kv is modified on consul all container are notified then render their config templates and reload their processes (not the contairner)

We are still on beta testing, because it's a huge change compare with the actual infrastructure

If you want to test it, fell free to use it

Sent from my iPhone

On 11 Jul 2017, at 20:47, Daniel Nelson notifications@github.com wrote:

This is something I am working on. The current plan is similar to described above, but instead of using inotify it will continue to require a signal to trigger config changes. Once this is done we should be able to work on creating more elaborate configuration plugins.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

itzg commented 6 years ago

This POC includes dynamic management of input plugin configurations; however, it went a different route than using KV stores and service discovery. I just wanted to share since the agent changes might be helpful in part or as a whole for a service discovery approach.

The README on that POC include a demo write-up of how the "managed input" concept works in practice.

danielnelson commented 6 years ago

Thanks @itzg, I'll take a look at it. @3fr61n can you link to the code you are referring too?

abraithwaite commented 6 years ago

This would be incredibly useful to have for selfish reasons.

Prometheus kubernetes discovery using annotations is pure gold. I would love to have this in telegraf.

https://github.com/prometheus/prometheus/tree/master/discovery/kubernetes

opsnull commented 6 years ago

any update?

atzoum commented 6 years ago

2846 appears to be blocked by this too.

danielnelson commented 6 years ago

I hope to have a pull request up soon for further discussion, it will contain a configuration plugin system in a style similar to the current input/output plugins.

danielnelson commented 6 years ago

@abraithwaite Can you take a look at the kubernetes_services option we added to the prometheus input and see if it works for your use case, it is only on the master branch but you can use the nightly builds.

abraithwaite commented 6 years ago

Unfortunately not. The value that prometheus provides with Kubernetes is that you configure metrics collection via the service (with kubernetes annotations) and not through the metrics collection agent.

This enables users to configure everything they need without having to setup something outside the scope of their own services.

I can provide examples if needed, just lemme know.

danielnelson commented 6 years ago

@abraithwaite can you link me to the Kubernetes documentation for the method you are using?

abraithwaite commented 6 years ago

Haven't seen any official documentation, actually. Just pieced it together from code, examples and blog posts:

https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml https://coreos.com/blog/prometheus-and-kubernetes-up-and-running.html https://github.com/prometheus/prometheus/issues/2989 https://github.com/prometheus/prometheus/issues/2009 https://movio.co/en/blog/prometheus-service-discovery-kubernetes/

abraithwaite commented 6 years ago

FWIW, I don't use prometheus with Kubernetes but the concept is extremely valuable and I'd still love to see it here.

I looked at the telegraf code though and I'm certain you'd need to add service discovery as a first class configuration method.

danielnelson commented 6 years ago

Just to clarify, the kubernetes_services option allows you to use the Kubernetes DNS cluster add-on to find and scrape prometheus endpoints without needing to update your Telegraf configuration file when a service is started/stopped.

abraithwaite commented 6 years ago

Right, I understand that. It still requires an explicit dependency between the service and telegraf, instead of an implicit one.

When using annotations, there is no PR a user has to make to update the telegraf config in order to start getting metrics from their service collected.

tmedford commented 6 years ago

I can agree that "Prometheus kubernetes discovery using annotations is pure gold. I would love to have this in telegraf." We use this to have prometheus dynamically find new targets. Would love to move back to telegraf for collection of metrics and uptime if this was supported.

narayanprabhu commented 5 years ago

Hi, I'm pretty new to the TICK stack and getting used to this. We are trying to setup the TICK stack as the monitoring platform for our organization. One question that has been pounding up is on how we manage the configurations - for instance if we need to monitor one service/process on a server we would have to make changes to the config on the server and restart telegraf. On doing some research I found this page and I think I'm posting my concern on the right place. Do we have a working model to manage configuration centrally?

voiprodrigo commented 5 years ago

@narayanprabhu I use Puppet to ease that kind of pain. It knows all the services that are “ensured” on each server, and that makes it easier to deploy a matching Telegraf config.

Sent with GitHawk

narayanprabhu commented 5 years ago

@voiprodrigo Yes puppet is a good option, unfortunately my organization does not have that solution. They mainly rely on SCCM for windows deployment and Ansible for the linux. This thread says that there is a UI option being built for chronograf to manage agent configs, is that option still being built. Wondering if that is coming up anytime soon?

And there is something about etcd where we can have one config consumed by other telegraf agents - is this some option that would help out my use case. Is this something that works for windows as well?

Jaeyo commented 5 years ago

@danielnelson any update?

danielnelson commented 5 years ago

Work is on hold right now (for the first item here), but I'm tempted to break this issue up into several issues:

  1. Loading config data from a configuration store (zookeeper/etcd/consul/etc), prototype code for a plugin config loading system here: https://github.com/danielnelson/tgconfig
  2. General purpose discovery: still needs more thought into what precisely it will be.
  3. Prometheus endpoint discovery: this will be done in #3901 and additional work if needed, but at least in the mid-term should be done to the prometheus input.
Jaeyo commented 5 years ago

in influxdb 2.0 alpha version, it has telegraf config generation ui. and telegraf was guided to take config from influxdb. but influxdb seems to have no edit config feature yet. so, here's question, do telegraf have any plan to synchronize config from influxdb 2.0?

rdxmb commented 1 year ago

Hello, so reloading (file) config without restart has not been implemented yet? It's a pitty. @blaggacao haven't you mentioned it "almost a no brainer"?

I'd like to use telegraf with a sidecar creating the configs...

EDIT: seems like there is a --watch-config, will try that immediately