stackabletech / commons-operator

Operator for common objects of the Stackable Data Platform
Other
8 stars 1 forks source link

Pod expiration drifts when system is suspended #302

Open nightkr opened 1 week ago

nightkr commented 1 week ago

Affected Stackable version

dev (24.11 prerelease)

Current and expected behavior

@xeniape ran into an issue (sble employees: see slack) where pods would be left with expired certificates after a while, rather than getting evicted by commons-op as expected. Restarting commons-op evicted the pods, as expected.

Our current working hypothesis here is that commons-op's re-reconciliation timer didn't advance while the computer was suspended, causing the eviction to be delayed by the same amount of time.

Possible solution

Either:

  1. Change the timer to use wall time instead of monotonic/CPU time
  2. Cap the re-reconciliation timer, causing spurious reconciles but at least limiting the issue
  3. Make the timer automatically expire when resuming from suspend

Either way, we should probably also communicate upstream with kube-rs and either fix it there or highlight the issue somehow.

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

None