hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.78k stars 1.94k forks source link

Nomad Service Discovery unable to find service #16983

Open vincenthuynh opened 1 year ago

vincenthuynh commented 1 year ago

Nomad version

Nomad v1.4.7

Operating system and Environment details

Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux

Issue

Allocation is unable to find Nomad service when it exists. It seems to start happening on a client after an uptime of 2-3 days.

Reproduction steps

Able to list service:

$ nomad service list -namespace="*"
Service Name               Namespace  Tags
myservice  default    []

Expected Result

Able to discover a service consistently

Actual Result

Task log:

Template | Missing: nomad.service(myservice)

Job file (if appropriate)

Task 1:

    service {
      provider = "nomad"
      name     = "myservice"
      port     = "redis"
    }

Task 2:

      template {
        data = <<EOH
{{range nomadService "myservice"}}
spring.redis.host: {{ .Address }}
spring.redis.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
      }

Nomad Client logs

2023-04-25T16:10:02.354Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 5 after "4s")
2023-04-25T16:10:06.355Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 6 after "8s")
2023-04-25T16:10:14.356Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 7 after "16s")
vincenthuynh commented 1 year ago

The workaround is to restart the Nomad service/agent on the client node.

shoenig commented 1 year ago

Hi @vincenthuynh so far I haven't been able to reproduce what you're seeing - in my cases the template is always successfully rendered once the upstream task is started and its serivce is registered. Before I dig in further, could you post a complete job file you're using that experiences the issue? I want to make sure we're not missing something (e.g. using group vs. task services, etc.)

the test job file i've been using

bug.hcl ```hcl job "bug" { group "group" { network { port "http" { to = 8080 } } task "python" { driver = "raw_exec" config { command = "python3" args = ["-m", "http.server", "8080"] } service { provider = "nomad" name = "python" port = "http" } resources { cpu = 10 memory = 32 } } task "client" { driver = "raw_exec" template { data = <
vincenthuynh commented 1 year ago

Hi @shoenig,

We've noticed that it takes a few days (2-3 days) before it starts happening.

Here's another reproduction:

  • An old allocation was stopped and a new one was created and it happened to be on the same node: image
  • It's unable to find the service: image
  • Applying the workaround: Simply restarting the nomad service on the client allows the task to discover the service again and start successfully.

Here's our job file:

myservice.hcl ``` job "myservice" { group "myservice" { network { mode = "bridge" } service { name = "myservice" port = "8080" tags = [ "env=${var.env}", "version=${var.version}"] connect { sidecar_service {} } } task "myservice" { driver = "docker" leader = true config { image = "gcr.io/myservice" work_dir = "/local" } vault { policies = ["myservice"] change_mode = "noop" } template { data = <

Hope that helps. Thanks!

gulducat commented 1 year ago

I encountered a similar issue caused by having NOMAD_ADDR set in the environment that nomad agent was run in. That var apparently went through to the Nomad API client that consul-template uses, and caused it to fail its API calls (in my case, for HTTP vs. HTTPS reasons) for the nomadService lookup.

My errors happened very consistently, so different I think from this case, but wanted to mention here for anyone else who finds this issue like I did. My solution was to ensure NOMAD_ADDR is not set in my nomad agent environment.

IamTheFij commented 1 year ago

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

image

Restarting the allocation seems to resolve the issue and force Nomad to re-register the service.

Unfortunately, this time it was my log aggregator that disappeared, so I don't have an easy way to pull logs from around the time of the issue. I'll try to grab them the next time it happens to a different service.

tfritzenwallner-private commented 9 months ago

This issue still consistently happens for us every 2-3 days. I can observe exactly the same as @IamTheFij however we run nomad 1.6.3.

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

mikedvinci90 commented 4 months ago

I have observed same issue for nomad 1.7.3 to 1.7.7

benbourner commented 2 months ago

I observer the same problem in v1.8.1

dmclf commented 2 months ago

seeing same problem, services are clearly visible on Nomad UI, but cannot be used by templating.

Nomad 1.7.7 (multi-region, multi-dc and ACL enabled)

(Consul-based service templating works fine and reliable, as opposed to Nomad-based services)

benbourner commented 2 months ago

Yet another example in nomad 1.8.1, it's just happening randomly among my services. Because i have traefik parsing the nomad services, they just disappear from traefik and are thus inaccessible. After rock-solid running for years, now the nomad deployments are just unreliable... :(

Service are up, healthy and reachable on the given ports... image

But the service allocations have again disappeared so traefik no longer sees them, so I can't access them via proper URLs... image

dannyhpy commented 2 months ago

I've seen this occuring frequently under poor network conditions where

  1. the client misses a heartbeat to the server
  2. the server unregisters the services managed by this client in response
  3. the client responds to the next heartbeat
  4. but the server does not seem to register the services back

I don't know if this is intended and if it is the same issue people are having here.

As a workaround, I was restarting the Nomad agent on the client every 20 mins. (I didn't need HA)