Nomad Service Discovery unable to find service

Nomad version

Nomad v1.4.7

Operating system and Environment details

Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux

Issue

Allocation is unable to find Nomad service when it exists. It seems to start happening on a client after an uptime of 2-3 days.

Reproduction steps

Task 1: register a service myservice using the Nomad provider
Task 2: use template stanza and NomadService function to reference the service that was registered in Task 1

Able to list service:

$ nomad service list -namespace="*"
Service Name               Namespace  Tags
myservice  default    []

Expected Result

Able to discover a service consistently

Actual Result

Task log:

Template | Missing: nomad.service(myservice)

Job file (if appropriate)

Task 1:

    service {
      provider = "nomad"
      name     = "myservice"
      port     = "redis"
    }

Task 2:

      template {
        data = <<EOH
{{range nomadService "myservice"}}
spring.redis.host: {{ .Address }}
spring.redis.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
      }

Nomad Client logs

2023-04-25T16:10:02.354Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 5 after "4s")
2023-04-25T16:10:06.355Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 6 after "8s")
2023-04-25T16:10:14.356Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 7 after "16s")

The workaround is to restart the Nomad service/agent on the client node.

Hi @vincenthuynh so far I haven't been able to reproduce what you're seeing - in my cases the template is always successfully rendered once the upstream task is started and its serivce is registered. Before I dig in further, could you post a complete job file you're using that experiences the issue? I want to make sure we're not missing something (e.g. using group vs. task services, etc.)

the test job file i've been using

bug.hcl

```hcl job "bug" { group "group" { network { port "http" { to = 8080 } } task "python" { driver = "raw_exec" config { command = "python3" args = ["-m", "http.server", "8080"] } service { provider = "nomad" name = "python" port = "http" } resources { cpu = 10 memory = 32 } } task "client" { driver = "raw_exec" template { data = <

Hi @shoenig,

We've noticed that it takes a few days (2-3 days) before it starts happening.

Here's another reproduction:

An old allocation was stopped and a new one was created and it happened to be on the same node:
It's unable to find the service:
Applying the workaround: Simply restarting the nomad service on the client allows the task to discover the service again and start successfully.

Here's our job file:

myservice.hcl

``` job "myservice" { group "myservice" { network { mode = "bridge" } service { name = "myservice" port = "8080" tags = [ "env=${var.env}", "version=${var.version}"] connect { sidecar_service {} } } task "myservice" { driver = "docker" leader = true config { image = "gcr.io/myservice" work_dir = "/local" } vault { policies = ["myservice"] change_mode = "noop" } template { data = <

Hope that helps. Thanks!

I encountered a similar issue caused by having NOMAD_ADDR set in the environment that nomad agent was run in. That var apparently went through to the Nomad API client that consul-template uses, and caused it to fail its API calls (in my case, for HTTP vs. HTTPS reasons) for the nomadService lookup.

My errors happened very consistently, so different I think from this case, but wanted to mention here for anyone else who finds this issue like I did. My solution was to ensure NOMAD_ADDR is not set in my nomad agent environment.

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

Restarting the allocation seems to resolve the issue and force Nomad to re-register the service.

Unfortunately, this time it was my log aggregator that disappeared, so I don't have an easy way to pull logs from around the time of the issue. I'll try to grab them the next time it happens to a different service.

This issue still consistently happens for us every 2-3 days. I can observe exactly the same as @IamTheFij however we run nomad 1.6.3.

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

I have observed same issue for nomad 1.7.3 to 1.7.7

I observer the same problem in v1.8.1

seeing same problem, services are clearly visible on Nomad UI, but cannot be used by templating.

Nomad 1.7.7 (multi-region, multi-dc and ACL enabled)

(Consul-based service templating works fine and reliable, as opposed to Nomad-based services)

Yet another example in nomad 1.8.1, it's just happening randomly among my services. Because i have traefik parsing the nomad services, they just disappear from traefik and are thus inaccessible. After rock-solid running for years, now the nomad deployments are just unreliable... :(

Service are up, healthy and reachable on the given ports...