hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.94k stars 1.96k forks source link

health check fails during grace period #9431

Open kneufeld opened 3 years ago

kneufeld commented 3 years ago

Nomad version

Nomad v1.0.0-beta3 (fcb32ef7ba037b2b83e7a0fd1c53c7410a2990db)

Operating system and Environment details

Ubuntu 18.04.5 LTS

Issue

During startup the task is occasionally killed for being unhealthy despite:

  1. being in the check_restart { grace = "30s" } time
  2. being in migrate { healthy_deadline = "10m" } time

Reproduction steps

Only happens occasionally

Job file (if appropriate)

Complicated job with 5 tasks, 1 job per task, connect proxies between them. Can email if desired.

Nomad Client logs (if appropriate)

From hashi-ui

WHEN            DURATION          TYPE                   MESSAGE   SIGNAL   CODE
1 minute    0 seconds   Received    Task received by client     0
37 seconds  35 seconds  Task Setup  Building Task Directory     0
37 seconds  0 seconds   Downloading Artifacts   Client is downloading artifacts     0
34 seconds  3 seconds   Driver  Downloading image       0
26 seconds  8 seconds   Started Task started by client      0
4 seconds   21 seconds  Restart Signaled    healthcheck: check "director" unhealthy     0
0 seconds   6 seconds   Terminated  Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"        137
0 seconds   0 seconds   Restarting  Task restarting in 10.989924836s        0
1 second    13 seconds  Started Task started by client

Nomad Server logs (if appropriate)

Can't find anything good. Hashi-ui relative logs vs journald absolute logs are annoying. This is actually from Friday (3 days ago).

tgross commented 3 years ago

Hi @kneufeld! Is your check_restart.limit set to 1? It looks from the documentation that we shouldn't be checking at all until the grace period. Can you provide the jobspec?

kneufeld commented 3 years ago

Is your check_restart.limit set to 1?

@tgross As a matter a fact it is...

I'll email the jobspec.

tgross commented 3 years ago

Just wanted to circle back on this @kneufeld: I've been trying to reproduce the problem by building up a jobspec incrementally with all the features in the one you've provided and haven't yet hit on the source of the issue. I see you have #9307 open as well so hopefully folks in that thread will have some more data for you as well.

kneufeld commented 3 years ago

Thank you for taking a look. This isn't a major issue, just something I felt should be documented.

9307 on the other hand is a major problem for us and will likely have us moving to kubernetes.

spaulg commented 3 years ago

I'm finding a similar thing happen to one of my jobs. I have a MySQL container running. MySQL does a temporary startup/shutdown cycle before starting properly. To skip this, based on the docs, the grace flag in the check_restart stanza should delay the health checks failing to recognise the service. But that's not happening. I'm seeing failed health check notices in Consul's log output despite the grace period of 60s not having passed.

Nomad version: v1.0.4 (9294f35f9aa8dbb4acb6e85fa88e3e2534a3e41a) Consul version: v1.9.4 (10bb6cb3b)

Here's a timeline of the data, based on different data sources:

Date/Time Nomad Consul MySQL
14:28:14 Task received by client
14:28:14 Building task directory
14:28:14 Downloading image
14:28:54 Task started by client
14:28:54 [temporary] init server in progress
14:29:00 Health check failed
14:29:03 [temporary] ready for connections
14:29:07 [temporary] shutdown started
14:29:09 [temporary] shutdown complete
14:29:10 Health check failed
14:29:10 init server in progress
14:29:11 ready for connections
job "mysql-server" {
  datacenters = ["dc1"]
  type        = "service"

  group "mysql-server" {
    count = 1

    network {
      port "db" {
        static = 3306
      }
    }

    task "mysql-server" {
      driver = "docker"

      config {
        image = "mysql:8.0.22"
        ports = ["db"]

        # Add nice cap to prevent minutely logging
        cap_add = ["SYS_NICE"]
      }

      env {
        MYSQL_ROOT_PASSWORD = "password"
      }

      resources {
        cpu    = 100
        memory = 1000
      }

      service {
        name = "mysql-server"
        port = "db"

        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"

          check_restart {
            grace = "60s"
          }
        }
      }
    }
  }
}
pznamensky commented 3 years ago

I can confirm that in Nomad v1.0.6 (592cd4565bf726408a03482da2c9fd8a3a1015cf) updating a task leads to restarts due to slow start. We also have check_restart.grace which is longer than application is ready but we have similar problem and similat timeline as @spaulg has.

maxramqvist commented 3 years ago

We can confirm the same issue on Nomad v1.1.4.

maxramqvist commented 1 year ago

The same issue there with 1.5.1. The grace period is not honored, task is marked as unhealthy just a few seconds after Nomad reports "task started".

Anything we can do to help with this?

Job spec:

{
  "ID": "application",
  "Name": "application",
  "Type": "service",
  "Datacenters": [
    "dev"
  ],
  "Meta": {
    "timestamp": "1681980389",
    "version": "cb84602"
  },
  "Constraints": [
    {
      "LTarget": "${node.class}",
      "RTarget": "build",
      "Operand": "!="
    }
  ],
  "Update": {
    "Stagger": 10000000000,
    "MaxParallel": 1,
    "HealthCheck": "checks",
    "MinHealthyTime": 25000000000,
    "HealthyDeadline": 100000000000,
    "AutoRevert": true,
    "AutoPromote": true,
    "Canary": 1
  },
  "TaskGroups": [
    {
      "Name": "application",
      "Count": 1,
      "Networks": [
        {
          "Mode": "bridge",
          "DynamicPorts": [
            {
              "Label": "healthcheck",
              "Value": 0,
              "To": -1,
              "HostNetwork": "default"
            },
            {
              "Label": "envoy-metrics",
              "Value": 0,
              "To": 9102,
              "HostNetwork": "default"
            }
          ]
        }
      ],
      "Services": [
        {
          "Name": "application",
          "Port": 80,
          "PortLabel": "80",
          "AddressMode": "auto",
          "Tags": [
            "namespace=default",
            "service=application",
            "awesomejobid=e27853c7-8a3b-4332-9efc-8aa835775c93",
            "external",
            "mesh",
            "prometheus_metrics_path=/metrics",
            "prometheus_metrics_address=${attr.unique.network.ip-address}:${NOMAD_PORT_metrics}",
            "prometheus_metrics=false"
          ],
          "Checks": [
            {
              "Type": "http",
              "Protocol": "http",
              "Port": null,
              "Path": "/api/health",
              "AddressMode": "",
              "Interval": 10000000000,
              "Timeout": 2000000000,
              "CheckRestart": {
                "Limit": 2,
                "Grace": 30000000000,
                "IgnoreWarnings": true
              },
              "PortLabel": "healthcheck",
              "Expose": true
            }
          ],
          "Connect": {
            "Native": false,
            "SidecarService": {
              "Tags": [
                "mesh",
                "ingress",
                "prometheus_metrics_address=${NOMAD_HOST_ADDR_envoy-metrics}",
                "prometheus_metrics_path=/metrics"
              ],
              "Proxy": {
                "ExposeConfig": {
                  "Path": [
                    {
                      "Path": "/api/health",
                      "Protocol": "http",
                      "LocalPathPort": 80,
                      "ListenerPort": "healthcheck"
                    }
                  ]
                },
                "Upstreams": [
                  {
                    "DestinationName": "backend-app",
                    "LocalBindPort": 13370
                  }
                ],
                "Config": {
                  "envoy_prometheus_bind_addr": "0.0.0.0:9102",
                  "local_request_timeout_ms": 0
                }
              }
            },
            "SidecarTask": {
              "ShutdownDelay": 10000000000,
              "KillTimeout": 10000000000,
              "Config": {
                "labels": [
                  {
                    "environment": "dev",
                    "namespace": "default",
                    "service": "application-connect-proxy"
                  }
                ]
              },
              "Resources": {
                "CPU": 50,
                "MemoryMB": 96
              }
            }
          }
        }
      ],
      "Tasks": [
        {
          "Name": "application",
          "Vault": null,
          "Driver": "docker",
          "Config": {
            "network_mode": null,
            "image": "the-image:123abc",
            "volumes": [
              "/usr/local/share/ca-certificates/:/certs/"
            ],
            "logging": [],
            "labels": []
          },
          "Services": null,
          "Env": {
            "NAMESPACE": "default"
          },
          "Templates": [],
          "Resources": {
            "CPU": 100,
            "MemoryMB": 256
          },
          "RestartPolicy": {
            "Attempts": 10,
            "Interval": 1800000000000,
            "Delay": 15000000000,
            "Mode": "fail"
          },
          "Artifacts": [],
          "KillTimeout": 10000000000,
          "ShutdownDelay": 10000000000
        }
      ],
      "EphemeralDisk": {
        "SizeMB": 300
      }
    }
  ]
}
jorgemarey commented 1 year ago

Hi, I noticed the same thing. I was expecting nomad would wait for the grace period before restarting the allocation, but it didn't.

This happened not at the first start, but after some time running, the application failed and during starting it waited less than grace + check interval * times.

Looking at the code and at the logs, I noticed one thing that I don't know if it's intentional (in that case I didn't saw that documented) or a bug. In case of a restart after the first boot, the grace interval contains the restart delay. So if the delay is 25s and the check_restart grace is 60s. The application would wait for a useful grace of 35 seconds and then the check restart limits are applied (check interval * (times -1)).

mikenomitch commented 1 year ago

Hey @jorgemarey, thanks for the sleuthing regarding the restart delay and grace. I think we'll probably want to change behavior, but at least document it clearly if not. Added to our todo list, but not in the immediate todo column.

It will realistically be at least a month before we could take this up, so if anybody wants to take a crack at making sure the grace period accounts for the delay, let us know, we'd accept a PR!

jorgemarey commented 1 year ago

Hi @mikenomitch, I'll try to make a PR for this if I find the time.

jorgemarey commented 9 months ago

Hi @mikenomitch, I was looking into this again. I saw that this only happens when the check is defined at a group level (if it's defined at task level the delay applied is correct) I think the problem is that the group service hook is executed on the prerun stage So it's registered as the checkRestarter executed before the restart is completed.

Could the group_service_hook RegisterWorkload be moved to poststart fase as it's in the service_hook? I don't know if this conflicts with other things, but should solve this issue. At least what I'm referring to, that is that the check_restart.grace contains the restart.delay