System job not restarting after client failure.

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

https://www.nomadproject.io/

Other

14.76k stars 1.94k forks source link

System job not restarting after client failure. #15069

Open blmhemu opened 1 year ago

blmhemu commented 1 year ago

Nomad version

1.4.2

Operating system and Environment details

Ubuntu arm64

Issue

If the client is down, system job on that client is not restarted unless manually done.

Also the job status in client shows 2 failed. It should be 1 failed 1 passed because as you can see below, there is one job running.

Reproduction steps

Run a system job. Take the client (or the whole cluster ?) down. Bring the nodes up. Check if the system job has all allocations.

Expected Result

All allocations present.

Actual Result

Not all allocations present.

Job file (if appropriate)

Same as https://github.com/hashicorp/nomad/issues/14932

Nomad Server logs (if appropriate)

Could see the alloc was killed due to

Template failed: nomad.var.get(nomad/jobs/caddy/caddy/caddy@default.global): Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)

Nomad Client logs (if appropriate)

lgfa29 commented 1 year ago

Hi @blmhemu 👋

Thanks for the report, do you happen to have more logs around the time the issue happen so we can get a better picture of what was going on with the connection between the client and servers?

Thanks!

blmhemu commented 1 year ago

Hey ! I did not find any relevant logs at the time. But If I change the network mode to normal, this issues does not occur. I think using bringe cni plugin is causing this issue. Also note that there has been a client restart.

lgfa29 commented 1 year ago

Do you have any logs available? It's kind of hard to investigate without more information 😅

blmhemu commented 1 year ago

2022-10-18T15:04:08Z  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.65.37 has been allocated to c31f3174-7c95-6c1e-d782-ea00579f84c3, duplicate allocation is not allowed

This is one log I found.

ostkrok commented 7 months ago

Hi,

Just wanted to report that we are also seeing this exact problem where some allocations of system jobs will not be restarted on a node in certain cases (often related to the node having been disconnected from the cluster). I'll try to dig up some logs and attach here later today.

For other people having issues with this, here is what we usually do when we discover this:

nomad status | grep system | cut -f 1 -d " " | xargs -L1 nomad job eval

I.e. force Nomad to re-evaulate all system jobs in our cluster. It's not pretty, but it fixes the missing allocations without having to restart the jobs.

p1u3o commented 7 months ago

I think I've started encountering the same issue. I was able to mitigate the issue by putting ExecStartPre=/bin/sleep 90 in the nomad.service, although the above command also works.

This is a wild guess, but in a quick restart (e.g. in a VM), if the client is restarted before the heartbeat_grace period is reached, it doesn't seem to think the client was down and attempts to resume the allocations.

But, the network namespace seems down or it doesn't attempt to re-create them, I can see this because if no new allocations scheduled, there won't be a "nomad" bridge interface until it attempts to create a new allocation.

The sleep forces the client to be considered down.

mwild1 commented 6 months ago

This feels like a duplicate of, or closely related to, https://github.com/hashicorp/nomad/issues/12023