hashicorp / consul-esm

External service monitoring for Consul
Mozilla Public License 2.0
263 stars 41 forks source link

ESM Check Never Passes #73

Open alievrouw opened 4 years ago

alievrouw commented 4 years ago

I'm working on an implementation of ESM. I was able to get it running, and register an external service. However, the service never shows as passing. I was able to register the example external service from the Register External Services with Consul Service Discovery guide. I see it in Consul's UI as a service, and in the logs for ESM it show it as being registered. The service never transitions from critical to passing. It just stays critical permanently. I'm trying to understand why, and if something about the way I have ESM configured is the problem. I tried switching from ping_type udp to socket but it made no difference. Everything seems to be working, except the actual status of a check (which is the whole point). If I deploy it with a status of passing it just stays passing (obviously).

Here is a sanitized copy of the config I'm using:

"ca_file" = "/local/consul_ca.pem"
"cert_file" = "/local/consul_cert.pem"
"consul_kv_path" = "consul-esm/"
"consul_service" = "consul-esm"
"datacenter" = "dev-aws"
"enable_syslog" = false
"external_node_meta" = {
  "external-node" = "true"
}
"http_addr" = "https://consul.service.consul:8501"
"key_file" = "/local/consul_key.pem"
"log_level" = "INFO"
"node_probe_interval" = "10s"
"node_reconnect_timeout" = "72h"
"ping_type" = "socket"
"token" = "my-token-123"
"tls_server_name" = "consul.service.consul"

I also noticed when setting the log level to debug that I get entries with '[DEBUG] No nodes to probe', even though there's a node registered. I'm not sure what that indicates. I'm running the latest 0.4.0 release.

lornasong commented 4 years ago

Hi @alievrouw - thanks for reaching out! Thanks for sharing your esm config. I took a look and it looks ok to me, nothing stands out as being off.

I'd like to try to reproduce the issue you are seeing. Would you also be able to share the request + payload you are making to register the external services? It sounds like these might be different the ones in the "Register External Services with Consul Service Discovery" guide since those start off with the status passing rather than critical.

Thanks!

alievrouw commented 4 years ago

Here is the payload:

[user@server ~]$ cat hashi-esm.json 
{
  "Node": "hashicorp",
  "Address": "learn.hashicorp.com",
  "NodeMeta": {
    "external-node": "true",
    "external-probe": "true"
  },
  "Service": {
    "ID": "learn1",
    "Service": "learn",
    "Port": 80
  },
  "Checks": [
    {
      "Name": "http-check",
      "Definition": {
        "http": "https://learn.hashicorp.com/consul/",
        "interval": "30s"
      }
    }
  ]
}

And here is the request:

curl \
  --silent \
  --insecure \
  --key /path/to/key.pem \
  --cert /path/to/cert.pem \
  --request PUT \
  --data @google-esm.json \
  --header "X-Consul-Token: TOKEN123" \
https://localhost:8501/v1/catalog/register

I removed the example config option of "status": "passing" as I wanted to see the check go from 'critical' to 'passing'. The example has you setup and test this locally. I already have a production and development environment running Nomad, Consul and Vault, and as such am testing this there. I was testing this in my dev environment and didn't have a good target to disable to see the check switch to 'critical'. In the example this is accomplished by disconnecting the local machine (where this is supposed to be running) from the Internet. I can try deploying this to my prod cluster where I could more easily simulate a failure. I guess I'm more concerned with the fundamentals of why the check never went from 'critical' to 'passing'. Is that simply something Consul checks don't do? Do they have to be deployed as passing and then fail? If that's the case how do they go from 'critical' back to 'passing' once the failed resource has recovered? I'm trying to understand if I've done something with ESM or if I need a better understanding of Consul checks in general. Thanks for the fast response!!

lornasong commented 4 years ago

Hi @alievrouw - thanks for the additional details!

One thing I noticed is that your curl is to https://localhost:8501/v1/catalog/register while in the ESM config, the address of the local Consul agent is set to: "http_addr" = "https://consul.service.consul:8501”.

I’m not sure if this is just a typo in the comment, but I mention this because of your observation that you are seeing [DEBUG] No nodes to probe in the logs, which makes me wonder if ESM is successfully retrieving the checks from Consul to run.

To confirm this, when you start up ESM (having registered your check), you should expect to see something similar to:

[INFO] Trying to obtain leadership...
[INFO] Obtained leadership
[INFO] Updating external node list, set to 1 items
[INFO] Rebalanced 1 external nodes across 1 ESM instances
[INFO] Fetched 1 nodes from catalog
[INFO] Updated 1 checks, found 1, added 1, updated 0, removed 0
[INFO] Now managing 1 health checks across 1 nodes
[INFO] Updated coordinates for node "hashicorp" with distance...

Note how in various logs it mentions that it found 1 node/check/health check

If ESM does not retrieve any checks from Consul, you’ll instead see logs similar to:

[INFO] Rebalanced 0 external nodes across 1 ESM instances
[INFO] Fetched 0 nodes from catalog

Would you be able to confirm that the curl address that you used to register the check refers to the same Consul agent as ESM’s configured local Consul agent? If those are accurate, would you be able to share the logs that you see when you start up ESM, having registered your check already?

To answer your question about what to expect from ESM, your understanding is correct that you should expect to see that the status goes from "critical" to "passing" in your check example since https://learn.hashicorp.com/consul/ is indeed up and running. ESM updates the status from a defined check and is expected to be able to toggle between "critical" and "passing" statuses. Let me know if you have more questions.

Thanks!