quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
8.21k stars 336 forks source link

Make sure the janitor does not spam the metastore too much #5353

Closed fulmicoton closed 6 days ago

fulmicoton commented 2 months ago

In https://github.com/quickwit-oss/quickwit/pull/5346 we have spotted that our implementation of delete index was too aggressive.

For airmail, their internal job deleting a large number of indexes ended up hammering the metastore, hence disrupting indexing.

We want to make sure that we don't have a similar pattern in the janitor. In particular, when running the retention policy.

fulmicoton commented 2 months ago

(@trinity-1686a maybe there is not problem... If so, please just comment here and close the ticket)

trinity-1686a commented 2 months ago

there is definitely a problem here. Last i checked, the retention policy is executed on a strict cron-like schedule. If many indexes share the same schedule frequency, they would all run at once (technically, one after the other in quick succession, as fast as possible). Right now based on airmail logs, it seems we run roughly 20k retention policies all at once.

trinity-1686a commented 2 months ago

we also seem to execute all GC calls at once, but scoping them by index, which causes many consecutive call, and much more often (every 10 or so minutes). That's something that can also be improved upon