Open aft2d opened 2 weeks ago
I can confirm that too. With or without the option -deregister-after-critical 10s
, the service never deregister
My workaround was to create a post-stop container that deregisters the service. To accomplish this, you have to add -proxy-id
to the prestart container that registers the service and set it to the alloc id. In the post-stop container, you can deregister the service again.
That's what I did:
job "api-gateway" {
...
task "prestart" {
driver = "docker"
config {
image = "docker.io/hashicorp/consul:1.20.1"
command = "/bin/sh"
args = [
"-c",
"consul connect envoy -proxy-id ${NOMAD_ALLOC_ID} -gateway api -register -service ${NOMAD_JOB_NAME} -admin-bind 0.0.0.0:19000 -ignore-envoy-compatibility -bootstrap > ${NOMAD_ALLOC_DIR}/envoy_bootstrap.json"
]
}
lifecycle {
hook = "prestart"
sidecar = false
}
identity {
name = "consul_default"
aud = ["consul.io"]
ttl = "1h"
}
}
...
task "poststop" {
driver = "docker"
config {
image = "docker.io/hashicorp/consul:1.20.1"
command = "/bin/sh"
args = [
"-c",
"consul services deregister -id ${NOMAD_ALLOC_ID} ; exit 0"
]
}
lifecycle {
hook = "poststop"
sidecar = false
}
identity {
name = "consul_default"
aud = ["consul.io"]
ttl = "1h"
}
}
}
}
@aft2d Thanks for the share. I can confirm that it's working with the poststop.
The Consul service for the API gateway is registered via the setup task and not via Nomad itself: https://github.com/hashicorp-guides/consul-api-gateway-on-nomad/blob/578de58653b4557bb50cc0bba3ba5b8fdf47ab70/api-gateway.nomad.hcl#L51
So nomad doesn't take care of de-registration if the allocation gets stopped and moved to another node.
While the used command has the
-deregister-after-critical
flag set, it doesn't have any effect since the registered service doesn't have any checks and is therefore always healthy.The result is the existence of many leftover service instances not tied to any running alloc and, in turn, a downstream services being unable to reach the gateway.