GC limits > 3 days are in effect infinite b/c of FSM timetable limit

stswidwinski commented 1 year ago

Nomad version

1.5.0 and anything prior.

Operating system and Environment details

Unix.

Issue

When Garbage collection limits are set to a value larger than 3 days, the Nomad Scheduler will never garbage collect the required object leading to infinite accumulation of data (and infinite memory and disk leak) and related resources (such as CSI volumes). The GC limits included are at least:

The expected behavior is that it is possible to set garbage collection limits at a much larger maximal value than 3 days to allow for history build up and easier debugging.

The details of the bug.

At the time of garbage collection, Nomad will derive an approximate raft index which is used as a watermark for garbage collection. The mapping of time to such an index is handled uniformly via:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/core_sched.go#L1133-L1143

This relies on fsm and the TimeTable which is initialized within. To be precise, the initialization of this table occurs here:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/fsm.go#L170

With a hard-coded maximal time table limit:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/fsm.go#L27-L29

If the limit is breached, the resolution of the index is going to default to zero:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/timetable.go#L93-L106

Hence, thresholdIndex = 0 which results in any check of the form X.modifyIndex > thresholdIndex to evaluate to true resulting in no garbage collection. For instance, for eval s:

https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/core_sched.go#L282-L288

Repro.

The simplest way to reproduce this behavior is by modifying the code to change the maximal time table limit of fsm to something small and observe that no GC occurs for evaluations which should be GCed. A unit test of Fsm or garbage collection may also be used to confirm the behavior.

tgross commented 1 year ago

Hi @stswidwinski! That's certainly a nasty bug. I'm pretty sure the reason we limit the time table to 72h is to avoid having infinite growth of that table, but yeah that definitely assumes that we're not setting thresholds greater than that. It'd probably be reasonable to have the configuration find the oldest GC threshold and double it in the FSM configuration, but we'd want to document warnings around that this will potentially allow a good bit of memory growth.

tgross commented 1 year ago

hashicorp / nomad