Open Dgotlieb opened 1 year ago
Thanks for reporting, @Dgotlieb. There is no hard limit on the number of groups that can exist in a job spec, however as you've found you may end up overwhelming the ability to process the amount of evaluations that get created in a pathological case like all of those tasks entering a crash loop. There's been some work around load shedding of evaluations recently; we'll want to see if any of that applies to this case or if we need to do something extra.
OK, I will also add that once the job has stopped, I still saw errors in the cluster logs, servers disappearing from the UI/CLI, Failures in Consul health checks for ports 4646/4647/4648, and other issues which could take hours to get resolved on their own.
@shoenig just a few more points:
Thanks
I would also be interested in the answers to the above questions.
Nomad version
Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)
Operating system and Environment details
Infra resources
10 Clients (3 are also servers) with the below spec:
Issue
I have a job with 780 groups with 2 tasks inside each group. When all the tasks enter a restart loop (all crashing together) Nomad doesn't handle it well.
Reproduction steps
Run the below Job file and enter all tasks into a restart loop.
Expected Result
Nomad should handle the restart loop and not get stuck.
Actual Result
UI - all tabs are stuck and often show server error message
CLI - no operations work with the below errors:
Job file
Server.hcl
Client.hcl
Suspect
I suspect the size of the final job HCL containing 780 groups is the issue. I'm not sure how the HCL size is related to the raft indices but many errors are raft-related.
Worth Mentioning
Connect
blocks from the groups to reduce the number of containers, health-checks and Consul-related operations$ nomad system gc
,$ nomad system reconcile summaries
,$ nomad stop <job_id> -purge
are throwing the mentioned errors with no ability to repair the state (even manually).but I wonder if a restart loop of multiple containers is something that was tested and the scheduler should be able to handle.
Questions