Closed fulmicoton closed 6 days ago
(@trinity-1686a maybe there is not problem... If so, please just comment here and close the ticket)
there is definitely a problem here. Last i checked, the retention policy is executed on a strict cron-like schedule. If many indexes share the same schedule frequency, they would all run at once (technically, one after the other in quick succession, as fast as possible). Right now based on airmail logs, it seems we run roughly 20k retention policies all at once.
we also seem to execute all GC calls at once, but scoping them by index, which causes many consecutive call, and much more often (every 10 or so minutes). That's something that can also be improved upon
In https://github.com/quickwit-oss/quickwit/pull/5346 we have spotted that our implementation of delete index was too aggressive.
For airmail, their internal job deleting a large number of indexes ended up hammering the metastore, hence disrupting indexing.
We want to make sure that we don't have a similar pattern in the janitor. In particular, when running the retention policy.