hashicorp / consul-template

Template rendering, notifier, and supervisor for @HashiCorp Consul and Vault data.
https://www.hashicorp.com/
Mozilla Public License 2.0
4.75k stars 782 forks source link

Monitoring consul-template #570

Open rhoml opened 8 years ago

rhoml commented 8 years ago

Is there a way to have a status/health endpoint on consul-template for monitoring that everything is ok and it can connect to consul?

sethvargo commented 8 years ago

Hi @rhoml

This is not currently possible, but it's definitely something worth considering. Originally I thought a simple HTTP server with a status endpoint would be useful, but my fear is that many users run multiple instances of consul template on a single machine, and that could cause port collisions, etc. I'm thinking, instead, or having CT respond to one of the user-defined signals (USR1 perhaps) and returning a result that way, but I don't think it's possible to return a result to a signal call.

/cc @slackpad

sean- commented 8 years ago

The ability to see the status of a consul-template instance via Consul would be interesting. How about having consul-template register as a "local service" (doesn't exist yet, but a service without a DNS entry) and have it post a status via a TTL check. Would that work for you?

sethvargo commented 8 years ago

@sean- I like this! I think it would be a good pattern to establish for all "local" consul tooling (CT, envconsul, consul-replicate, etc). What do we need to do in Consul to make that possible?

sean- commented 8 years ago

We'd need a "local" flag for a service registration in consul. Services with the local flag wouldn't have an address and could not be looked up via DNS (but do show up via the http API). But, it solves the ACL issue and provides an endpoint to register checks through which we can post status.

We could use a magic address like 0.0.0.0 to specify this type, but I'm not wild about conflating addresses with reg types and think registration needs a flag. We'll talk it up tomorrow and see what's involved.

doublerebel commented 8 years ago

I like the idea of consul-template reporting health via TTL, and also being tagged local. Monitoring a service without an address definitely is valuable.

But I think many of the use-cases for local do require an address. I have services that are local to the Consul agent, and only listen on one address. Not all my services support multiple listeners, and sometimes are purposefully segregated by address/interface. I would still want to be able to filter these services as 'local'.

slackpad commented 8 years ago

It seems like it should be easy to attach a consul-template TTL check as one of the checks for whatever service consul-template is managing, not necessarily as a separate service of its own. If consul-template dies then your instance of the service is suspect because it's no longer getting configured properly. With that it will be clear what's affected vs. just knowing that one of the consul-template instances is down.

slackpad commented 8 years ago

Talking to @sean- offline I'm coming around to some of the earlier suggestions. Perhaps a local service can have a pid defined and no address/port which would keep it out of DNS. Tools like consul-template could register under the consul-template service name and perhaps register some extra details like command line so operators could figure out which instance it was and what it was doing.

@doublerebel I don't think I fully understand your use case for a local service that still has an address/port. Are you thinking along the lines of https://github.com/hashicorp/consul/pull/1231#issuecomment-142059460 where you want to find the instances of a given service running locally on the box with a particular agent?

jippi commented 8 years ago

@slackpad would be nice to have as part of all Vault, CT, envconsul services - this kind of service registration(s) in Consul

rhoml commented 8 years ago

@jippi totally that would be neat for monitoring all core services

doublerebel commented 8 years ago

@slackpad thanks for the consideration. I have run into issues where consul-template dies and a long-running service doesn't discover it until long after, when it finally restarts. Then the cause (of dead consul-template) is difficult to correlate with the effect (the service in a bad state). Especially when the service (without consul-template) goes back to a default value, so it's almost-but-not-quite right.

Re: local You're correct in referencing consul#1231, it's just the semantics of how consul defines local. Perhaps services without an address could be called "internal" to differentiate them from "local"? i.e. I am implementing the vault cubbyhole method which requires my co-process to be able to find "local" services which may or may not be "internal". But now I fear I'm derailing this issue into the local topic.

ketzacoatl commented 7 years ago

When running consul-template as a core service in a cluster (eg, not on nomad, but as a service which is available irrespective of nomad's status), it's difficult to properly register consul-template as a service and ensure the health checks are correct. It would be very helpful if consul-template were to register itself as a service in the consul catalog.

simonvanderveldt commented 7 years ago

+1 for this! A simple HTTP endpoint would be enough in our case (we run everything in separate containers).

mikezh15 commented 7 years ago

+1 for this! It would be great to register consul-template as "local/internal" service with health check on consul. @sethvargo: is there any plan to implement this idea?

alileza commented 6 years ago

Originally I thought a simple HTTP server with a status endpoint would be useful, but my fear is that many users run multiple instances of consul template on a single machine, and that could cause port collisions, etc.

@sethvargo what about having those endpoint as an options, such as

consul-template -health=enabled -health.port=8080

I believe it would solve problem for some people.

lesinigo commented 6 years ago

In other similar situations having something like a "checkpoint" status file has always been good enough for us and maybe it is simpler to implement than a full fledged HTTP server/endpoint.

I wouldn't keep constantly updating the destination file timestamp because it could cause nasty side effects with some softwares consuming that file, but I'd add a configuration key that accepts a file path and keeps touch 'ing it (updating its modification time) at regular intervals to signal "consul-template is working correctly and we are sure that the destination file is up to date with what was in consul at this time".

Bonus points if it allows the "checkpoint file" to be the same as the output file, so people can choose between leaving the output file mtime unmodified and track status with a different file, or have everything in one file that keeps getting its mtime updated.

Common monitoring systems have the ability to check for a file "freshness", usually out of the box (eg. check_file_age), and it is also really easy to check within shell scripts either for "max age" (eg. find -mmin) or comparison with other files (eg. if [ checkpoint_file -ot some_reference_file ])

dfredell commented 6 years ago

I would love a good way for containerpilot to monitor consul-template's health.

For now I'm just using pgrep, this way I can chain jobs together via once-healthy. All it does is verify there is a process with the name consul-template running.

    {
      "name": "consul-template",
      "exec": [
        "consul-template",
        "-config",
        "/app.hcl"
      ],
      "when": {
        "source": "consul-agent",
        "once": "healthy"
      },
      "health":{
        "exec": [
          "test",
          "$(pgrep consul-template | wc -l) -eq '1'"
        ],
        "interval": 15,
        "ttl": 25,
        "timeout": "1s"
      },
      "restarts": "unlimited"
    }
drawks commented 3 years ago

This issue is pretty old, what is the current best practice for monitoring consul-template?

eikenb commented 3 years ago

Hey @drawks,

You might want to consider asking this on hashicorp's discuss forum, more community members probably would see it there and be able to rely their solutions.

I think the answer might just be that as consul-template is designed to exit if anything bad enough to trigger a failed health check happens (or at least that's the idea), so the normal process management setups you get from systemd, etc. work to keep it running without needing an external health monitor. That plus a monitor on the process consul-template is managing, which you'd need anyways, is probably enough for most cases. Though you probably want to take this with a grain of salt as I'm just the maintainer, I don't actively use consul-template in the field at the moment and can only base my answers on past experiences and what I hear from everyone else.

Thanks.

mrwacky42 commented 3 years ago

While I don't use consul-template anymore - at some point, we had Prometheus's node_exporter monitoring the systemd task for it, and had this alarm defined in Prometheus as:

avg_over_time(node_systemd_unit_state{name="consul-template.service",state="active"}[5m]) < 1