hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Inconsistent timing of checks #11310

Open lettore opened 3 years ago

lettore commented 3 years ago

Overview of the Issue

I'm using Consul for service discovery and health check of Docker services. I register the service with Gliderlabs Registrator and I use a custom script that reports the correct execution via webhook to an Healtchecks.io instance. The problem is that Consul is not respecting the timing of the execution of the check. Looks like Consul doesn't respect any timing as it runs the checks with variable timing of 20 sec to 10 or more minutes.

Expected behaviour

  1. Register a Service with script check with 30 seconds of interval
  2. Consul should trigger the check every 30 seconds

Actual behaviour

  1. Register a Service with script check with 30 seconds of interval
  2. Consul runs the check at random intervals, this varies from 20 sec to more than 10 minutes

This is a wrong behaviour, as this affect the true availability of the service. Could happen that the service goes down and Consul misses this downtime because is not executing the check.

jkirschner-hashicorp commented 3 years ago

Hi @lettore,

Can you provide the following additional information?

lettore commented 3 years ago

Output of consul info

bash-5.1# consul info
agent:
        check_monitors = 3
        check_ttls = 1
        checks = 4
        services = 4
build:
        prerelease = 
        revision = c976ffd2
        version = 1.10.3
consul:
        acl = disabled
        bootstrap = true
        known_datacenters = 1
        leader = true
        leader_addr = 10.50.70.206:8300
        server = true
raft:
        applied_index = 15600730
        commit_index = 15600730
        fsm_pending = 0
        last_contact = 0
        last_log_index = 15600730
        last_log_term = 138
        last_snapshot_index = 15599367
        last_snapshot_term = 137
        latest_configuration = [{Suffrage:Voter ID:7daf862c-ee70-fb95-b499-2e3a9123cf30 Address:10.50.70.206:8300}]
        latest_configuration_index = 0
        num_peers = 0
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 138
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 196
        max_procs = 4
        os = linux
        version = go1.16.7
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 137
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 74162
        members = 9
        query_queue = 0
        query_time = 5
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1

I'm running Consul on Docker Swarm in containers using the image consul:latest, the host are running ubuntu 20.04 x64. For the logging I see no error in logs, and there are hundreds of line because I use Gliderlabs/registrator for registering services so there is a lot of noise, here 10 minutes of logs of the service Adminer:

2021-10-14T18:42:14.500Z [TRACE] agent: Check output: check=service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080 output="280 0 | 2021-10-14 18:42:14 [] Executing command curl -s -o /dev/null -I -w %{http_code} http://10.50.70.186:8080/

2021-10-14T18:42:39.558Z [TRACE] agent: Check output: check=service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080 output="280 0 | 2021-10-14 18:42:39 [] Executing command curl -s -o /dev/null -I -w %{http_code} http://10.50.70.186:8080/

2021-10-14T18:43:25.518Z [TRACE] agent: Check output: check=service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080 output="280 0 | 2021-10-14 18:43:24 [] Executing command curl -s -o /dev/null -I -w %{http_code} http://10.50.70.186:8080/

2021-10-14T18:46:12.055Z [TRACE] agent: Check output: check=service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080 output="280 0 | 2021-10-14 18:46:11 [] Executing command curl -s -o /dev/null -I -w %{http_code} http://10.50.70.186:8080/

2021-10-14T18:46:29.692Z [TRACE] agent: Check output: check=service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080 output="280 0 | 2021-10-14 18:46:29 [] Executing command curl -s -o /dev/null -I -w %{http_code} http://10.50.70.186:8080/

2021-10-14T18:47:58.970Z [TRACE] agent: Check output: check=service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080 output="280 0 | 2021-10-14 18:47:58 [] Executing command curl -s -o /dev/null -I -w %{http_code} http://10.50.70.186:8080/

2021-10-14T18:51:14.440Z [TRACE] agent: Check output: check=service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080 output="280 0 | 2021-10-14 18:51:14 [] Executing command curl -s -o /dev/null -I -w %{http_code} http://10.50.70.186:8080/

2021-10-14T18:53:16.229Z [TRACE] agent: Check output: check=service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080 output="280 0 | 2021-10-14 18:53:15 [] Executing command curl -s -o /dev/null -I -w %{http_code} http://10.50.70.186:8080/

Look the timestamps, ther's no correlation with the time provided Here the service definition:

  "service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080":{
      "Node":"NUC",
      "CheckID":"service:NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080",
      "Name":"Service 'adminer-ui' check",
      "Status":"passing",
      "Notes":"",
      "Output":"280 0 | 2021-10-14 19:01:27 [] Executing command curl -s -o /dev/null -I -w %{http_code} http://10.50.70.186:8080/\n283 0 | 2021-10-14 19:01:27 [] Command returned code: 0\n284 0 | 2021-10-14 19:01:27 [] Command result: 200\n",
      "ServiceID":"NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080",
      "ServiceName":"adminer-ui",
      "ServiceTags":[
         "nuc",
         "int",
         "nuc-int"
      ],
      "Type":"script",
      "Interval":"",
      "Timeout":"",
      "ExposedPort":0,
      "Definition":{

      },
      "CreateIndex":0,
      "ModifyIndex":0
   },

As I can see Registrator is not setting up correctly the interval so I guess this can be the problem. I will try to find the reason, maybe something is changed in the API? It's also strange because neither Consul or Registrator complains, as I know Interval is required when registering a HTTP check.

jkirschner-hashicorp commented 3 years ago

@lettore : the exact format of that "service definition" you sent looks unfamiliar to me (e.g., there's no "Output" field in a service definition). Is that a health check response instead (rather the the definition of the service)? And is it coming from Consul directly, or from Registrator? If from Consul, which HTTP API endpoint?

lettore commented 3 years ago

It's from http://Consul:8500/v1/agent/checks Here I grabbed with ngrep the API call from Registrator that register that service:

#########
T 10.50.71.6:47702 -> 10.50.0.43:8500 [AP]
  PUT /v1/agent/service/register HTTP/1.0..Host: Consul_consul-nuc:8500..Connection: close..Content-Length: 1078..User-Agent: Go-http-client/1.1..Accept-Encoding: gzip....{"ID":
  "NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080","Name":"adminer-ui","Tags":["nuc","int","nuc-int"],"Port":8080,"Address":"10.50.70.186","Meta":{"check_deregister_after"
  :"10m","check_initial_status":"passing","check_interval":"120s","check_script":"mychecks.sh --a:*********l --n:adminer-ui --u:https://health.ecasa.one --
  t:600 --g:600 --tz:Europe/Rome --tg:adminer --ch:alerta_minor,repair --opt:hpfe --ip:$SERVICE_IP --port:$SERVICE_PORT --cmd:curl -s -o /dev/null -I -w %{http_code} http://$SER
  VICE_IP:$SERVICE_PORT/","network":"docker-external","repair_cmd":"docker service update --force Adminer_adminer"},"Check":{"ScriptArgs":["mychecks.sh","--a:*************l","--n:adminer-ui","--u:https://********","--t:600","--g:600","--tz:Europe/Rome","--tg:adminer","--ch:alerta_minor,repair","--opt:hpfe","--ip:10.50.70.186
  ","--port:8080","--cmd:curl","-s","-o","/dev/null","-I","-w","%{http_code}","http://10.50.70.186:8080/"],"Interval":"120s","Status":"passing","DeregisterCriticalServiceAfter":
  "10m"},"Checks":null}.                                                                                                                                                         
##
T 10.50.0.43:8500 -> 10.50.71.6:47702 [AP]
  HTTP/1.0 200 OK..Vary: Accept-Encoding..X-Consul-Default-Acl-Policy: allow..Date: Thu, 14 Oct 2021 21:25:45 GMT..Content-Length: 0....                                         
#########

The interval of 120s seems to be passed correctly. (Reformatting the output from above for clarity.)

{
  "ID": "NUC:Adminer_adminer.1.uieaf6gbdsuh2xve0hhiytj0v:8080",
  "Name": "adminer-ui",
  "Tags": [
    "nuc",
    "int",
    "nuc-int"
  ],
  "Port": 8080,
  "Address": "10.50.70.186",
  "Meta": {
    "check_deregister_after": "10m",
    "check_initial_status": "passing",
    "check_interval": "120s",
    "check_script": "mychecks.sh --a:*******************l --n:adminer-ui --u:https://health.ecasa.one --t:600 --g:600 --tz:Europe/Rome --tg:adminer --ch:alerta_minor,repair --opt:hpfe --ip:$SERVICE_IP --port:$SERVICE_PORT --cmd:curl -s -o /dev/null -I -w %{http_code} http://$SERVICE_IP:$SERVICE_PORT/",
    "network": "docker-external",
    "repair_cmd": "docker service update --force Adminer_adminer"
  },
  "Check": {
    "ScriptArgs": [
      "mychecks.sh",
      "--a:*************l",
      "--n:adminer-ui",
      "--u:https://********",
      "--t:600",
      "--g:600",
      "--tz:Europe/Rome",
      "--tg:adminer",
      "--ch:alerta_minor,repair",
      "--opt:hpfe",
      "--ip:10.50.70.186",
      "--port:8080",
      "--cmd:curl",
      "-s",
      "-o",
      "/dev/null",
      "-I",
      "-w",
      "%{http_code}",
      "http://10.50.70.186:8080/"
    ],
    "Interval": "120s",
    "Status": "passing",
    "DeregisterCriticalServiceAfter": "10m"
  },
  "Checks": null
}
lettore commented 2 years ago

I see no activity so I made some test myself, as Nomad seems to register the checks correctly I tried to grab the messages between Nomad and Consul:

T 10.50.170.157:45252 -> 10.50.0.43:8500 [AP]
  PUT /v1/agent/service/register HTTP/1.0..Host: Consul_consul-nuc:8500..Connection: close..Content-Length: 934..User-Agent: Go-http-client/1.1..Content-Type: application/json
  ..Accept-Encoding: gzip....{"ID":"_nomad-task-ea3b12cc-2dd7-8376-22c3-7139a3b1803d-zoneminder-zoneminder-http-http","Name":"zoneminder-http","Tags":["server","traefik.enable
  =true","traefik.constraint-label=traefik-public","traefik.http.middlewares.https-redirect.redirectscheme.scheme=https","traefik.http.middlewares.https-redirect.redirectschem
  e.permanent=true","traefik.http.routers.zoneminder-http.rule=Host(`zoneminder.****`)","traefik.http.routers.zoneminder-http.entrypoints=http","traefik.http.routers.zone
  minder-http.middlewares=https-redirect","traefik.http.routers.zoneminder-https.rule=Host(`zoneminder.****`)","traefik.http.routers.zoneminder-https.entrypoints=https","
  traefik.http.routers.zoneminder-https.tls=true","traefik.http.routers.zoneminder-https.tls.certresolver=le","traefik.http.services.zoneminder.loadbalancer.server.port=80"],"
  Port":80,"Address":"10.32.3.200","Meta":{"external-source":"nomad"},"Check":null,"Checks":null}.                                                                             
##
T 10.50.0.43:8500 -> 10.50.170.157:45252 [AP]
  HTTP/1.0 200 OK..Vary: Accept-Encoding..X-Consul-Default-Acl-Policy: allow..Date: Fri, 05 Nov 2021 17:22:15 GMT..Content-Length: 0....

T 10.50.170.157:45266 -> 10.50.0.43:8500 [AP]
  PUT /v1/agent/check/register HTTP/1.0..Host: Consul_consul-nuc:8500..Connection: close..Content-Length: 223..User-Agent: Go-http-client/1.1..Content-Type: application/json..
  Accept-Encoding: gzip....{"ID":"_nomad-check-7c42ba6153e529e5208d4eba9f0a6b4f05feaf9d","Name":"alive","ServiceID":"_nomad-task-4559d1b8-a714-85f6-c898-a34f79f6ddb1-tvheadend
  -tvheadend-ui","Interval":"10s","Timeout":"2s","TCP":"192.168.1.100:9981"}.                                                                                                  
##
T 10.50.0.43:8500 -> 10.50.170.157:45266 [AP]
  HTTP/1.0 200 OK..Vary: Accept-Encoding..X-Consul-Default-Acl-Policy: allow..Date: Fri, 05 Nov 2021 17:22:15 GMT..Content-Length: 0.... 

It seems to me that Nomad first register the service without a check, and next register only the check for that service. It is that required? Here the json registration of the service

{
   "ID":"_nomad-task-ea3b12cc-2dd7-8376-22c3-7139a3b1803d-zoneminder-zoneminder-http-http",
   "Name":"zoneminder-http",
   "Tags":[
      "server",
      "traefik.enable=true",
      "traefik.constraint-label=traefik-public",
      "traefik.http.middlewares.https-redirect.redirectscheme.scheme=https",
      "traefik.http.middlewares.https-redirect.redirectscheme.permanent=true",
      "traefik.http.routers.zoneminder-http.rule=Host(`zoneminder.****`)",
      "traefik.http.routers.zoneminder-http.entrypoints=http",
      "traefik.http.routers.zoneminder-http.middlewares=https-redirect",
      "traefik.http.routers.zoneminder-https.rule=Host(`zoneminder.****`)",
      "traefik.http.routers.zoneminder-https.entrypoints=https",
      "traefik.http.routers.zoneminder-https.tls=true",
      "traefik.http.routers.zoneminder-https.tls.certresolver=le",
      "traefik.http.services.zoneminder.loadbalancer.server.port=80"
   ],
   "Port":80,
   "Address":"10.32.3.200",
   "Meta":{
      "external-source":"nomad"
   },
   "Check":null,
   "Checks":null
}

and the check

{
   "ID":"_nomad-check-7c42ba6153e529e5208d4eba9f0a6b4f05feaf9d",
   "Name":"alive",
   "ServiceID":"_nomad-task-4559d1b8-a714-85f6-c898-a34f79f6ddb1-tvheadend-tvheadend-ui",
   "Interval":"10s",
   "Timeout":"2s",
   "TCP":"192.168.1.100:9981"
}
lettore commented 2 years ago

Today I made more tests and I found that registering the service with Curl with the same JSON data the check is correctly executed at the specified timing. Registrator uses the Go library to register the service PUT /v1/agent/service/register HTTP/1.0..Host: Consul_consul-nuc:8500..Connection: close..Content-Length: 629..User-Agent: Go-http-client/1.1..Accept-Encoding: gzip.... Instead curl PUT /v1/agent/service/register HTTP/1.0..Host: Consul_consul-nuc:8500..Connection: close..Content-Length: 246..User-Agent: curl/7.68.0..Accept: */*..Content-Type: application/x-www-form-urlencoded.... Anyway retrieving the check list under /agent/checks still list the Interval as empty.