Open vincenthuynh opened 1 year ago
The workaround is to restart the Nomad service/agent on the client node.
Hi @vincenthuynh so far I haven't been able to reproduce what you're seeing - in my cases the template is always successfully rendered once the upstream task is started and its serivce is registered. Before I dig in further, could you post a complete job file you're using that experiences the issue? I want to make sure we're not missing something (e.g. using group vs. task services, etc.)
the test job file i've been using
Hi @shoenig,
We've noticed that it takes a few days (2-3 days) before it starts happening.
Here's another reproduction:
Here's our job file:
Hope that helps. Thanks!
I encountered a similar issue caused by having NOMAD_ADDR
set in the environment that nomad agent
was run in. That var apparently went through to the Nomad API client that consul-template uses, and caused it to fail its API calls (in my case, for HTTP vs. HTTPS reasons) for the nomadService lookup.
My errors happened very consistently, so different I think from this case, but wanted to mention here for anyone else who finds this issue like I did. My solution was to ensure NOMAD_ADDR
is not set in my nomad agent
environment.
This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.
To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.
Restarting the allocation seems to resolve the issue and force Nomad to re-register the service.
Unfortunately, this time it was my log aggregator that disappeared, so I don't have an easy way to pull logs from around the time of the issue. I'll try to grab them the next time it happens to a different service.
This issue still consistently happens for us every 2-3 days. I can observe exactly the same as @IamTheFij however we run nomad 1.6.3.
This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.
To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.
I have observed same issue for nomad 1.7.3 to 1.7.7
I observer the same problem in v1.8.1
seeing same problem, services are clearly visible on Nomad UI, but cannot be used by templating.
Nomad 1.7.7 (multi-region, multi-dc and ACL enabled)
(Consul-based service templating works fine and reliable, as opposed to Nomad-based services)
Yet another example in nomad 1.8.1, it's just happening randomly among my services. Because i have traefik parsing the nomad services, they just disappear from traefik and are thus inaccessible. After rock-solid running for years, now the nomad deployments are just unreliable... :(
Service are up, healthy and reachable on the given ports...
But the service allocations have again disappeared so traefik no longer sees them, so I can't access them via proper URLs...
I've seen this occuring frequently under poor network conditions where
I don't know if this is intended and if it is the same issue people are having here.
As a workaround, I was restarting the Nomad agent on the client every 20 mins. (I didn't need HA)
Nomad version
Nomad v1.4.7
Operating system and Environment details
Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux
Issue
Allocation is unable to find Nomad service when it exists. It seems to start happening on a client after an uptime of 2-3 days.
Reproduction steps
myservice
using the Nomad providerNomadService
function to reference the service that was registered in Task 1Able to list service:
Expected Result
Able to discover a service consistently
Actual Result
Task log:
Job file (if appropriate)
Task 1:
Task 2:
Nomad Client logs