hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

GC limits > 3 days are in effect infinite b/c of FSM timetable limit #16359

Open stswidwinski opened 1 year ago

stswidwinski commented 1 year ago

Nomad version

1.5.0 and anything prior.

Operating system and Environment details

Unix.

Issue

When Garbage collection limits are set to a value larger than 3 days, the Nomad Scheduler will never garbage collect the required object leading to infinite accumulation of data (and infinite memory and disk leak) and related resources (such as CSI volumes). The GC limits included are at least:

  1. https://developer.hashicorp.com/nomad/docs/configuration/server#eval_gc_threshold
  2. https://developer.hashicorp.com/nomad/docs/configuration/server#batch_eval_gc_threshold
  3. https://developer.hashicorp.com/nomad/docs/configuration/server#deployment_gc_threshold
  4. https://developer.hashicorp.com/nomad/docs/configuration/server#job_gc_threshold
  5. https://developer.hashicorp.com/nomad/docs/configuration/server#acl_token_gc_threshold
  6. https://developer.hashicorp.com/nomad/docs/configuration/server#csi_plugin_gc_threshold
  7. https://developer.hashicorp.com/nomad/docs/configuration/server#csi_volume_claim_gc_interval

The expected behavior is that it is possible to set garbage collection limits at a much larger maximal value than 3 days to allow for history build up and easier debugging.

The details of the bug.

At the time of garbage collection, Nomad will derive an approximate raft index which is used as a watermark for garbage collection. The mapping of time to such an index is handled uniformly via:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/core_sched.go#L1133-L1143

This relies on fsm  and the TimeTable  which is initialized within. To be precise, the initialization of this table occurs here:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/fsm.go#L170

With a hard-coded maximal time table limit:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/fsm.go#L27-L29

If the limit is breached, the resolution of the index is going to default to zero:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/timetable.go#L93-L106

Hence, thresholdIndex = 0  which results in any check of the form X.modifyIndex > thresholdIndex  to evaluate to true  resulting in no garbage collection. For instance, for eval s:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/core_sched.go#L282-L288

Repro.

The simplest way to reproduce this behavior is by modifying the code to change the maximal time table limit of fsm  to something small and observe that no GC occurs for evaluations which should be GCed. A unit test of Fsm  or garbage collection may also be used to confirm the behavior.

tgross commented 1 year ago

Hi @stswidwinski! That's certainly a nasty bug. I'm pretty sure the reason we limit the time table to 72h is to avoid having infinite growth of that table, but yeah that definitely assumes that we're not setting thresholds greater than that. It'd probably be reasonable to have the configuration find the oldest GC threshold and double it in the FSM configuration, but we'd want to document warnings around that this will potentially allow a good bit of memory growth.

tgross commented 1 year ago

Related: https://github.com/hashicorp/nomad/issues/17233