hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.86k stars 1.95k forks source link

Nomad alloc_health_watcher marks alloc healthy before it even registered in consul #9175

Open pere3 opened 3 years ago

pere3 commented 3 years ago

Nomad version

Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61) Consul v1.0.1

Operating system and Environment details

CentOS Linux release 7.3.1611 (Core)

Issue

The issue is based on a relatively big delay between nomad client host register a new service for a rolling update, and syncing this service with a local consul which is resulting in a cases where rollout is over and done with no alive services in consul at all. As it seems from nomad monitor -log-level=TRACE logs (attaching core elements of this logs down the line), _alloc_healthwatcher fires two times for a new allocs on this client and even marks them as healthy, as actual consul.sync entries appears long after this process is done.

We talked this issue on hangops slack with @nickethier, and agreed on continuing discussion in an issue thread here: https://hangops.slack.com/archives/C1CAYV38T/p1603271878496000

We also send an email to nomad-oss-debug@hashicorp.com with a subject "nomad/consul, service registration delay, slack thread" with a full trace-log of a rollout process.

Reproduction steps

Simple rolling update.

Job file (before env substitution)

job "microservice-job" {
  datacenters = ["$DATACENTER"]

  type = "service"
  update {
    max_parallel = 1
    min_healthy_time = "10s"
  }

  group "microservice-job-web" {
    count = 3

    restart {
      mode = "delay"
      attempts = 3
      interval = "30s"
      delay = "10s"
    }

    network {
      port "http" {
        to = 80
      }
    }
    service {
      port = "http"
      task = "microservice-job-web"
      name = "microservice-job"
      tags = ["production", "containerized", "csharp"]

      check {
        type = "http"
        path = "/ping/?noresponse"
        interval = "5s"
        timeout = "2s"
        initial_status = "critical"
      }
    }

    task "microservice-job-web" {
      driver = "docker"

      shutdown_delay = "15s"

      env {
        CONSUL_HTTP_ADDR = "${meta.consul_http_addr}"
        HOST_IP = "${attr.unique.network.ip-address}"
        MCS_RUN_ENV = "PRODUCTION"
      }

      config {
        image = "docker.registry.com/microservice-job:$IMAGE_TAG"

        auth {
          username = "$DOCKER_REGISTRY_USERNAME"
          password = "$DOCKER_REGISTRY_PASSWORD"
        }

        cpu_cfs_period = 100000
        cpu_hard_limit = true

        labels = {
          application = "microservice-job"
          runtime = "csharp"
          type = "web"
        }

        args = ["WebApi.dll"]
        work_dir = "/app/web"
        ports = ["http"]
      }

      resources {
        cpu = 3072
        memory = 512
      }
    }
  }
}

Nomad Client logs

$ grep -E 'consul.sync|alloc_health_watcher' new_traces.out
2020-10-26T15:01:56.355+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-10-26T15:02:32.922+0300 [DEBUG] consul.sync: sync complete: registered_services=77 deregistered_services=1 registered_checks=0 deregistered_checks=1
2020-10-26T15:02:42.566+0300 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: watching: alloc_id=da7072aa-6074-c068-ee05-0f65a3dd8f3f deadline="2020-10-26 15:07:42.566344723 +0300 MSK m=+444288.247218999" checks=true min_healthy_time=10s
2020-10-26T15:02:48.180+0300 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: watching: alloc_id=63c0b3e3-e48f-52a9-df7b-f4ed161ee441 deadline="2020-10-26 15:07:48.180950848 +0300 MSK m=+444293.861825126" checks=true min_healthy_time=10s
2020-10-26T15:02:49.239+0300 [DEBUG] consul.sync: sync complete: registered_services=77 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-10-26T15:02:53.516+0300 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: health set: alloc_id=da7072aa-6074-c068-ee05-0f65a3dd8f3f healthy=true
2020-10-26T15:02:58.952+0300 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: health set: alloc_id=63c0b3e3-e48f-52a9-df7b-f4ed161ee441 healthy=true
2020-10-26T15:03:04.796+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=1 deregistered_checks=0
2020-10-26T15:03:18.985+0300 [DEBUG] consul.sync: sync complete: registered_services=77 deregistered_services=1 registered_checks=0 deregistered_checks=1
2020-10-26T15:03:33.466+0300 [DEBUG] consul.sync: sync complete: registered_services=77 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-10-26T15:03:48.621+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=1 deregistered_checks=0
2020-10-26T15:04:03.614+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-10-26T15:04:18.487+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=0 deregistered_checks=0
pere3 commented 3 years ago

UP, any updates?

tgross commented 3 years ago

Doing some issue cleanup. Likely cause of this is https://github.com/hashicorp/nomad/issues/3935, but let's keep this open till that work is done.