Consul service is never deregistered

aft2d commented 2 weeks ago

The Consul service for the API gateway is registered via the setup task and not via Nomad itself: https://github.com/hashicorp-guides/consul-api-gateway-on-nomad/blob/578de58653b4557bb50cc0bba3ba5b8fdf47ab70/api-gateway.nomad.hcl#L51

So nomad doesn't take care of de-registration if the allocation gets stopped and moved to another node.

While the used command has the -deregister-after-critical flag set, it doesn't have any effect since the registered service doesn't have any checks and is therefore always healthy.

The result is the existence of many leftover service instances not tied to any running alloc and, in turn, a downstream services being unable to reach the gateway.

Lord-Y commented 1 day ago

I can confirm that too. With or without the option -deregister-after-critical 10s, the service never deregister

aft2d commented 1 day ago

My workaround was to create a post-stop container that deregisters the service. To accomplish this, you have to add -proxy-id to the prestart container that registers the service and set it to the alloc id. In the post-stop container, you can deregister the service again.

That's what I did:


job "api-gateway" {
...
    task "prestart" {
      driver = "docker"

      config {
        image   = "docker.io/hashicorp/consul:1.20.1"
        command = "/bin/sh"

        args = [
          "-c",
          "consul connect envoy -proxy-id ${NOMAD_ALLOC_ID} -gateway api -register -service ${NOMAD_JOB_NAME} -admin-bind 0.0.0.0:19000 -ignore-envoy-compatibility -bootstrap > ${NOMAD_ALLOC_DIR}/envoy_bootstrap.json"
        ]
      }

      lifecycle {
        hook    = "prestart"
        sidecar = false
      }

      identity {
        name = "consul_default"
        aud = ["consul.io"]
        ttl  = "1h"
      }
    }

...
    task "poststop" {
      driver = "docker"

      config {
        image   = "docker.io/hashicorp/consul:1.20.1"
        command = "/bin/sh"

        args = [
          "-c",
          "consul services deregister -id ${NOMAD_ALLOC_ID} ; exit 0"
        ]
      }

      lifecycle {
        hook    = "poststop"
        sidecar = false
      }

      identity {
        name = "consul_default"
        aud = ["consul.io"]
        ttl  = "1h"
      }
    }
  }
}

Lord-Y commented 1 day ago

@aft2d Thanks for the share. I can confirm that it's working with the poststop.

hashicorp-guides / consul-api-gateway-on-nomad

Consul service is never deregistered #13