projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics
https://projectnessie.org
Apache License 2.0
1k stars 132 forks source link

[Catalog] Move GC functionality into Nessie Catalog #8733

Open snazy opened 4 months ago

snazy commented 4 months ago

Having to configure all the Iceberg and potentially Hadoop configuration options for Nessie GC is not particularly convenient. Nessie Catalog has all the object storage configurations and has access to the credentials.

Nessie GC is not extremely memory hungry, it is rather "just" a time consuming process that requires a lot of object storage I/O.

Moving Nessie GC into Nessie Catalog feels like a natural follow-up, which eliminates a lot of configuration headaches.

It needs to be explored whether change is a feasible option in multi-tenant scenarios.

adutra commented 3 months ago

For the record, I've been playing with a different approach using the Kubernetes Operator for Nesse: a new CRD called NessieGc that is reconciled into a CronJob (if recurring) or a Job (if one-shot).

Creating a NessieGc CRD manually creates a standalone GC job, either recurring or one-shot.

But more importantly, the main Nessie CRD has two new fields: gc.enabled and gc.schedule. If enabled, GC is then automatically started following the cron schedule, using the properties already defined in the Nessie CRD to configure the GC invocation. In this scenario, a NessieGc CRD is generated by the reconciler, and is a dependent resource whose lifecycle is tied to the parent Nessie CRD lifecycle.

nqvuong1998 commented 2 months ago

Hi @snazy , After moving GC into the Nessie Catalog, we should support SQL syntax for GC. For example: VACUUM