hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.94k stars 1.96k forks source link

consul service registration failed - invalid port / port label not found #9709

Closed MorphBonehunter closed 3 years ago

MorphBonehunter commented 3 years ago

Nomad version

Nomad v1.0.1 (c9c68aa55a7275f22d2338f2df53e67ebfcb9238)

Operating system and Environment details

ArchLinux 5.4.84-1-lts

Issue

Service registration fails every now and then on resubmitting job (for example update docker image version).

Reproduction steps

I can not reliable reproduce this in my environment as this occurs not on every job update. It seems to happen after i moved the deprecated network stanza in the task level to the group level and it is not bound to one job, it happens on different jobs when it appears. It happen to the job below after i change the following (which doesn't touch the service definiton etc.):

Job validation successful
+/- Job: "victoriametrics-job"
+/- Task Group: "victoriametrics-group" (1 in-place update)
  +/- Task: "victoriametrics-task" (forces in-place update)
    +/- Service {
        AddressMode:       "auto"
        EnableTagOverride: "false"
        Name:              "victoriametrics"
        PortLabel:         "http"
        TaskName:          ""
      + Tags {
        + Tags: "prometheus_monitoring"
          Tags: "traefik.enable=true"
          Tags: "traefik.http.routers.victoriametrics.entryPoints=https"
          Tags: "traefik.http.routers.victoriametrics.middlewares=compression@file,secureheaders@file,local-access-only@file"
          Tags: "traefik.http.routers.victoriametrics.rule=Host(`victoriametrics.<redacted>`)"
          Tags: "traefik.http.routers.victoriametrics.tls=true"
          Tags: "traefik.tags=host_${attr.unique.hostname}"
        }
        }

I can fix the problem with an stop/start cycle of the job.

Job file (if appropriate)

job "victoriametrics-job" {
  region = "home"
  datacenters = ["underverse"]
  type = "service"
  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "evelyn"
  }
  update {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "45s"
    healthy_deadline = "300s"
    auto_revert = true
  }
  group "victoriametrics-group" {
    count = 1
    restart {
      attempts = 3
      interval = "10m"
      delay    = "60s"
      mode     = "fail"
    }
    network {
      port "http" {
        to = 8428
      }
    }
    task "victoriametrics-task" {
      driver = "docker"
      config {
        image = "<redacted>/victoriametrics:2020-12-28-11-21"
        args = [
          "-storageDataPath=/victoriametrics",
          "-retentionPeriod=1y"
        ]
        mount {
          type = "volume"
          source = "victoriametrics"
          target = "/victoriametrics"
        }
        ports = [
          "http"
        ]
      }
      resources {
        cpu    = 500  # MHz
        memory = 2048 # MB
      }
      service {
        name = "victoriametrics"
        tags = [
          "prometheus_monitoring",
          "traefik.enable=true",
          "traefik.http.routers.victoriametrics.entryPoints=https",
          "traefik.http.routers.victoriametrics.tls=true",
          "traefik.http.routers.victoriametrics.rule=Host(`victoriametrics.<redacted>`)",
          "traefik.http.routers.victoriametrics.middlewares=compression@file,secureheaders@file,local-access-only@file",
          "traefik.tags=host_${attr.unique.hostname}"
        ]
        port = "http"
        check {
          name     = "victoriametrics-check"
          type     = "http"
          method   = "GET"
          path     = "/health"
          interval = "20s"
          timeout  = "5s"
          check_restart {
            limit = 3
            grace = "10s"
          }
        }
      }
    }
  }
}

Nomad Client logs (if appropriate)

I can see this error in the GUI and in the logs:

Dez 30 11:29:29 evelyn nomad[3146]:     2020-12-30T11:29:29.308+0100 [ERROR] client.alloc_runner.task_runner: update hook failed: alloc_id=68a48255-bbea-7801-7ab0-22f49f05e97a task=victoriametrics-task name=consul_services error="error getting address for check "victoriametrics-check": invalid port "http": port label not found"
apollo13 commented 3 years ago

Can you try if the errors are gone if you move the service stanza also up to the group level?

shoenig commented 3 years ago

Hi @MorphBonehunter , I think this job is actually not quite valid - the group.network stanza is trying to map a port using to, but that can only be used in bridge mode. https://www.nomadproject.io/docs/job-specification/network#to

MorphBonehunter commented 3 years ago

@shoenig this is the case. As you can see i use the docker driver without specific network mode setting and so this defaults to bridge. As the behavior only happens sporadic i guess this is more an race condition than an configuration problem. Also the nomad + docker part seems to work when this occurs, its the consul part which fail.

MorphBonehunter commented 3 years ago

Today i encounter the problem again. I deployd an new job and upgrade it after a while with an change to the http check path and i got again:

Jan  6 16:27:11 argos nomad[766]:     2021-01-06T16:27:11.226+0100 [ERROR] client.alloc_runner.task_runner: update hook failed: alloc_id=86bc96db-adf7-6923-ca5f-8cebb4f3b977 task=assets-task name=consul_services error="error getting address for check "assets-check": invalid port "http": port label not found"

This time it was also followed from some other messages which seems strange:

Jan  6 16:28:32 argos consul[596]:     2021-01-06T16:28:32.204+0100 [ERROR] agent.http: Request error: method=PUT url=/v1/agent/check/register from=127.0.0.1:41498 error="ServiceID "_nomad-task-86bc96db-adf7-6923-ca5f-8cebb4f3b977-assets-task-assets-http" does not exist"
Jan  6 16:28:32 argos nomad[766]:     2021-01-06T16:28:32.205+0100 [WARN]  consul.sync: failed to update services in Consul: error="Unexpected response code: 500 (ServiceID "_nomad-task-86bc96db-adf7-6923-ca5f-8cebb4f3b977-assets-task-assets-http" does not exist)"
Jan  6 16:28:32 argos consul[596]:     2021-01-06T16:28:32.213+0100 [ERROR] agent.http: Request error: method=PUT url=/v1/agent/check/register from=127.0.0.1:41498 error="ServiceID "_nomad-task-86bc96db-adf7-6923-ca5f-8cebb4f3b977-assets-task-assets-http" does not exist"

Which seems to be clear as the registration doesn't work. After a while the health timeout happens and the allocation was restarted. This time it was broken completely:

Jan  6 16:28:32 argos nomad[766]:     2021-01-06T16:28:32.730+0100 [ERROR] client.driver_mgr.docker: failed to create container configuration: driver=docker image_name=registry.<redacted>/webserver:2021-01-06-15-54 image_id=sha256:8f1521bf9c9b23a0857dd5aea6fd9749717e0860e3b9c16ca3c8d90f41f58b7f error="Port "http" not found, check network stanza"
Jan  6 16:28:32 argos nomad[766]:     2021-01-06T16:28:32.732+0100 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=86bc96db-adf7-6923-ca5f-8cebb4f3b977 task=assets-task error="Failed to create container configuration for image "registry.<redacted>/webserver:2021-01-06-15-54" ("sha256:8f1521bf9c9b23a0857dd5aea6fd9749717e0860e3b9c16ca3c8d90f41f58b7f"): Port "http" not found, check network stanza"

Restarting the task doesn't work, so i have to resubmit the job with an minor change (CPU MHz) to get it working again. But after all my log is full with msg like:

Jan 06 21:56:26 argos nomad[766]:     2021-01-06T21:56:26.879+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082
Jan 06 21:56:27 argos nomad[766]:     2021-01-06T21:56:27.775+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082
Jan 06 21:56:28 argos nomad[766]:     2021-01-06T21:56:28.676+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082
Jan 06 21:56:29 argos nomad[766]:     2021-01-06T21:56:29.576+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082
Jan 06 21:56:30 argos nomad[766]:     2021-01-06T21:56:30.476+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082

Even after Node drain this was spammed so i had to restart the drained nodes nomad. After restart i found the following in the log (beware, the node was still drained and has 0 allocs):

Jan 06 21:56:57 argos nomad[263003]: ==> Nomad agent started! Log data will stream in below:
Jan 06 21:56:58 argos nomad[263003]:     2021-01-06T21:56:58.842+0100 [ERROR] client.alloc_runner.task_runner: poststart failed: alloc_id=86bc96db-adf7-6923-ca5f-8cebb4f3b977 task=assets-task error="1 error occurred:
Jan 06 21:56:58 argos nomad[263003]:         * poststart hook "consul_services" failed: unable to get address for service "assets": invalid port "http": port label not found
Jan 06 21:56:58 argos nomad[263003]: "
drewbailey commented 3 years ago

@MorphBonehunter I believe this issue is fixed and covered by https://github.com/hashicorp/nomad/pull/9736

You should be able to reliably reproduce this by causing an inplace update to occur (updating a service stanza value). A destructive update like updating an environment variable will create a new allocation and avoid the issue.

9736 prevents inplace updates from nullifying the tasks ports, which eventually cause the error logs you have shared.

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.