Open notnoop opened 4 years ago
Hey @notnoop, I'm seeing similar errors after upgrading from Nomad 0.11.3 to 0.12.0.
Shortly after upgrading, we started seeing sporadic errors error updating job summary: unable to find task group in the job summary: job
when planning periodic jobs, but with no meaningful logs found on the Nomad at the time.
The jobs are already registered and no changes were made to them before and after the cluster upgrade.
Eventually retrying nomad job plan ...
works, but still seems to happen periodically.
Hey @notnoop I'm running into the same issue @evandam is as well.
Dropping some additional information/context around when I was seeing this issue.
We started seeing the error updating job summary
errors with job plans after restarting the Nomad service in our environment, before we actually upgraded from Nomad 0.11.3 to 0.12.0.
In addition, we saw a "ghost allocation" that did not get stopped after any new deployments since the restart and had to be stopped by hand.
nomad job status prod-platform-core
Might be better off as a separate issue, but worth mentioning!
Thank you very much for reporting more data. I'll need to dig into this further and will follow up with some clarifying questions. I'm very surprised that the error occurred for a non-parameterized/periodic job!
Nomad FSM handling is sometimes strict in handling log entries by insuring that some invariants always hold, and fail early if it notices inconsistencies or invalid state.
While it shows good intention, the state does get into a corrupt state due to random bugs and it makes recovery hard.
We studied a cluster running 0.8 which upgraded to 0.10. The cluster ended up with some corrupt state possibly due to https://github.com/hashicorp/nomad/issues/4299 and summary jobs being out of sync.
These had cascading effects in few places:
This was reported as well in https://github.com/hashicorp/nomad/issues/5939 .
In both of these cases, strict enforcement of invariants exacerbated the situation and made cluster recovery harder. We can consider having automated processes (e.g. if job summary is invalid recompute it, deletion should idempotent and deleting already deleted job shouldn't result into an error).
In the upgrade scenario above, it's unclear to me how the invalid state came to be. My guess is that it was due to bugs in 0.8 (like the ones linked above) but the upgrade to 0.10 exacerbated the situation.
Should scan the FSM/planner checks and ensure that we can recover once an invalid state is already committed to cluster.