Open snazy opened 4 months ago
For the record, I've been playing with a different approach using the Kubernetes Operator for Nesse: a new CRD called NessieGc
that is reconciled into a CronJob
(if recurring) or a Job
(if one-shot).
Creating a NessieGc
CRD manually creates a standalone GC job, either recurring or one-shot.
But more importantly, the main Nessie
CRD has two new fields: gc.enabled
and gc.schedule
. If enabled, GC is then automatically started following the cron schedule, using the properties already defined in the Nessie
CRD to configure the GC invocation. In this scenario, a NessieGc
CRD is generated by the reconciler, and is a dependent resource whose lifecycle is tied to the parent Nessie
CRD lifecycle.
Hi @snazy , After moving GC into the Nessie Catalog, we should support SQL syntax for GC. For example: VACUUM
Having to configure all the Iceberg and potentially Hadoop configuration options for Nessie GC is not particularly convenient. Nessie Catalog has all the object storage configurations and has access to the credentials.
Nessie GC is not extremely memory hungry, it is rather "just" a time consuming process that requires a lot of object storage I/O.
Moving Nessie GC into Nessie Catalog feels like a natural follow-up, which eliminates a lot of configuration headaches.
It needs to be explored whether change is a feasible option in multi-tenant scenarios.