Handling of committed inconsistent/corrupt state

notnoop commented 4 years ago

Nomad FSM handling is sometimes strict in handling log entries by insuring that some invariants always hold, and fail early if it notices inconsistencies or invalid state.

While it shows good intention, the state does get into a corrupt state due to random bugs and it makes recovery hard.

We studied a cluster running 0.8 which upgraded to 0.10. The cluster ended up with some corrupt state possibly due to https://github.com/hashicorp/nomad/issues/4299 and summary jobs being out of sync.

These had cascading effects in few places:

The job summary being out of sync resulted into scheduler not being able to process updates to that job - failures lines included something like the following:

2020-05-20T23:33:18.760Z [ERROR] nomad.fsm: ApplyPlan failed: error="error updating job summary: unable to find task group in the job summary: xxx"
2020-05-20T23:33:18.020Z [ERROR] worker: failed to submit plan for evaluation: eval_id=49941cf9-2126-f25b-f718-e6d7ad2db28e error="rpc error: error updating job summary: unable to find task group in the job summary: xxx"
2020-05-20T23:32:34.300Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: error updating job summary: unable to find task group in the job summary: xxx"

This was reported as well in https://github.com/hashicorp/nomad/issues/5939 .

If an job batch deregister eval included a job already, the job de-registration would fail and the entire batch fails to be removed. This happened in this cluster with the following log lines:

2020-05-21T15:21:59.197Z [ERROR] nomad.fsm: deregistering job failed: job=""<ns: "default", id: "REDACTED/periodic-1590018900">" error="job not found"
2020-05-21T15:21:59.198Z [ERROR] nomad.fsm: DeleteJob failed: error="job not found"
2020-05-21T15:21:59.062Z [ERROR] nomad.fsm: DeleteDeployment failed: error="deployment not found"

In both of these cases, strict enforcement of invariants exacerbated the situation and made cluster recovery harder. We can consider having automated processes (e.g. if job summary is invalid recompute it, deletion should idempotent and deleting already deleted job shouldn't result into an error).

In the upgrade scenario above, it's unclear to me how the invalid state came to be. My guess is that it was due to bugs in 0.8 (like the ones linked above) but the upgrade to 0.10 exacerbated the situation.

Should scan the FSM/planner checks and ensure that we can recover once an invalid state is already committed to cluster.

evandam commented 4 years ago

Hey @notnoop, I'm seeing similar errors after upgrading from Nomad 0.11.3 to 0.12.0.

Shortly after upgrading, we started seeing sporadic errors error updating job summary: unable to find task group in the job summary: job when planning periodic jobs, but with no meaningful logs found on the Nomad at the time.

The jobs are already registered and no changes were made to them before and after the cluster upgrade.

Eventually retrying nomad job plan ... works, but still seems to happen periodically.

IrinaBron commented 4 years ago

Hey @notnoop I'm running into the same issue @evandam is as well.

evandam commented 4 years ago

Dropping some additional information/context around when I was seeing this issue.

We started seeing the error updating job summary errors with job plans after restarting the Nomad service in our environment, before we actually upgraded from Nomad 0.11.3 to 0.12.0.

In addition, we saw a "ghost allocation" that did not get stopped after any new deployments since the restart and had to be stopped by hand.

nomad job status prod-platform-core

``` ID = prod-platform-core Name = prod-platform-core Submit Date = 2020-07-15T07:47:30-07:00 Type = service Priority = 50 Datacenters = prod-usw2-prod1 Namespace = default Status = running Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost app 0 5 13 0 1061 16 Latest Deployment ID = 807dcd4f Status = successful Description = Deployment completed successfully Deployed Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline app true true 12 5 12 12 0 2020-07-15T15:00:45Z Allocations ID Node ID Task Group Version Desired Status Created Modified 4dfef693 e09a4e06 app 1418 run running 35m7s ago 33m56s ago e6af8836 096b7d22 app 1418 run running 35m10s ago 34m8s ago 44e53e09 e09a4e06 app 1418 run running 36m11s ago 35m6s ago d697c3ad 80f9f3ec app 1418 run running 36m11s ago 35m8s ago a6834b5a 37ba0b54 app 1418 run running 36m11s ago 35m12s ago 2082765e 37ba0b54 app 1418 run running 36m11s ago 35m3s ago 06ce06c0 80f9f3ec app 1418 run running 36m11s ago 35m7s ago 236000f6 f339639f app 1418 run running 37m11s ago 36m15s ago c8b46c76 e09a4e06 app 1418 run running 37m11s ago 36m11s ago 66baa990 37ba0b54 app 1418 run running 37m11s ago 36m16s ago 8080c584 096b7d22 app 1418 run running 37m11s ago 36m16s ago eedb181f 80f9f3ec app 1418 run running 37m11s ago 36m12s ago 37546e28 e09a4e06 app 1417 stop complete 14h16m ago 34m56s ago e6cbe54b 096b7d22 app 1417 stop complete 14h16m ago 35m4s ago a7208d5d 37ba0b54 app 1417 stop complete 14h17m ago 36m ago 3ce508d3 f339639f app 1417 stop complete 14h17m ago 36m5s ago 97ea4707 80f9f3ec app 1417 stop complete 14h17m ago 36m5s ago 4a2c4e9f e09a4e06 app 1417 stop complete 14h17m ago 36m5s ago 5c920064 80f9f3ec app 1417 stop complete 14h17m ago 36m2s ago 91729b2d 80f9f3ec app 1417 stop complete 14h18m ago 36m5s ago 84faf102 37ba0b54 app 1417 stop complete 14h18m ago 36m5s ago 0d5acd27 e09a4e06 app 1417 stop complete 14h18m ago 36m ago 58eb0325 37ba0b54 app 1417 stop complete 14h18m ago 36m5s ago 686828bd 096b7d22 app 1417 stop complete 14h18m ago 36m5s ago e7c0e967 f339639f app 1385 run running 5d16h ago 5d16h ago ```

Might be better off as a separate issue, but worth mentioning!

notnoop commented 4 years ago

Thank you very much for reporting more data. I'll need to dig into this further and will follow up with some clarifying questions. I'm very surprised that the error occurred for a non-parameterized/periodic job!

hashicorp / nomad

Handling of committed inconsistent/corrupt state #8097