hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.44k stars 4.43k forks source link

service script health checks not working #6923

Open celesteking opened 4 years ago

celesteking commented 4 years ago

I'm into day 5 of messing with Consul and it's buggy as hell. I can't get so-called service script health checks working.

This is what I'm doing:

# cat /tmp/test.json 
{"service":
  {"name": "web",
    "port": 80,
    "check": {
      "args": ["curl", "localhost"],
      "interval": "10s"
    }
  }
}

# consul services register  /tmp/test.json
Registered service: web

# http://127.0.0.1:8500/v1/health/service/web?pretty
.... returning only  "CheckID": "serfHealth",  type check, but NO service health check.

# curl  http://127.0.0.1:8500/v1/agent/checks?pretty
{}

# grep enable_scri /consul/config/client.hcl 
enable_script_checks  = true

Why on earth it would just silently skip health check? What kind of torturing software is this? Setting log level to trace doesn't show anything extraordinary.

celesteking commented 4 years ago

Taken literally from https://www.consul.io/api/agent/service.html#sample-request-2:

# curl -i  -XPUT  --data @/tmp/test.json http://127.0.0.1:8500/v1/agent/service/register
HTTP/1.1 400 Bad Request
Vary: Accept-Encoding
Date: Tue, 10 Dec 2019 20:29:24 GMT
Content-Length: 20
Content-Type: text/plain; charset=utf-8

Missing service name

This software is a complete utter mess.

blake commented 4 years ago

Hi @celesteking,

I'm sorry you've had such a frustrating experience with Consul. It seems like the consul services register issue you referenced is a bug. I was able to reproduce the same confusing behavior on my end. The problem seems to be caused by two issues.

  1. The check parameter in the service definition is silently ignored unless a name parameter is also specified as part of the check. For example:

    {
      "service": {
        "name": "web",
        "port": 80,
        "check": {
          "name": "test",
          "args": [
            "curl",
            "localhost"
          ],
          "interval": "10s"
        }
      }
    }

    The docs for /agent/service/register/ state "…If you don't provide a name or ID for the check then they will be generated." However, this doesn't seem to be the case. Name appears to be a required field, as documented in /api/agent/check.html#register-check.

  2. The second issue is that, after the check is recognized, the Consul client does not correctly parse the corresponding script arguments from the check definition. This results in an incorrect payload being sent to the Consul server API, which generates an error.

    consul services register posted-example.json
    Error registering service "web": Unexpected response code: 400 (Invalid check: TTL must be > 0 for TTL checks)

    I was able to successfully register the service after renaming AgentServiceCheck.Args in Consul's agent/api.go to AgentServiceCheck.ScriptArgs, and trying again with newly built client. I have no idea as to whether this is the correct change, so I'll defer to Consul's engineering team.

You can work around this by performing the service registration using cURL. I see you previously attempted this, and ran into a snag. It seems like you're using the same /tmp/test.json file from the consul services register request. You'll need to modify the JSON payload a bit and remove the parent service key, as shown in www.consul.io/api/agent/service.html#sample-payload, before this will succeed.

The resultant object structure should appear as follows.

# /tmp/test.json
{
  "name": "web",
  "port": 80,
  "check": {
    "name": "test",
    "args": [
      "curl",
      "localhost"
    ],
    "interval": "10s"
  }
}

You should then be able to create the service using cURL.

$ curl -i -X PUT --data @/tmp/test.json http://127.0.0.1:8500/v1/agent/service/register
HTTP/1.1 200 OK
Vary: Accept-Encoding
Date: Wed, 11 Dec 2019 09:10:04 GMT
Content-Length: 0

And view the corresponding health check using the health API endpoint.

$ curl http://127.0.0.1:8500/v1/health/service/web\?pretty
[
    {
        "Node": {
            "ID": "8232e63a-bb5a-49e6-0b4b-f3aded994b0e",
            "Node": "b1000.local",
            "Address": "127.0.0.1",
            "Datacenter": "dc1",
            "TaggedAddresses": {
                "lan": "127.0.0.1",
                "wan": "127.0.0.1"
            },
            "Meta": {
                "consul-network-segment": ""
            },
            "CreateIndex": 9,
            "ModifyIndex": 10
        },
        "Service": {
            "ID": "web",
            "Service": "web",
            "Tags": [],
            "Address": "",
            "Meta": null,
            "Port": 80,
            "Weights": {
                "Passing": 1,
                "Warning": 1
            },
            "EnableTagOverride": false,
            "Proxy": {
                "MeshGateway": {},
                "Expose": {}
            },
            "Connect": {},
            "CreateIndex": 27,
            "ModifyIndex": 27
        },
        "Checks": [
            {
                "Node": "b1000.local",
                "CheckID": "serfHealth",
                "Name": "Serf Health Status",
                "Status": "passing",
                "Notes": "",
                "Output": "Agent alive and reachable",
                "ServiceID": "",
                "ServiceName": "",
                "ServiceTags": [],
                "Type": "",
                "Definition": {},
                "CreateIndex": 9,
                "ModifyIndex": 9
            },
            {
                "Node": "b1000.local",
                "CheckID": "service:web",
                "Name": "Service 'web' check",
                "Status": "critical",
                "Notes": "",
                "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\ncurl: (7) Failed to connect to localhost port 80: Connection refused\n",
                "ServiceID": "web",
                "ServiceName": "web",
                "ServiceTags": [],
                "Type": "script",
                "Definition": {},
                "CreateIndex": 27,
                "ModifyIndex": 47
            }
        ]
    }
]

Thank you for providing such candid feedback. We clearly have some opportunities to improve the product and documentation to make it easier to use. Please let us know if you encounter additional issues.

simonctrlz commented 2 years ago

It boggles the mind that this is still an issue.

utdrmac commented 12 months ago

It boggles the mind that this is still an issue, at the end 2023.

porterctrlz commented 12 months ago

@blake what's the story here?

garry-t commented 7 months ago

It boggles the mind that this is still an issue, at the 2024 :(

sang-kevin commented 2 months ago

Why hasn't this defect been resolved?