Closed schobe closed 3 months ago
This is not a bug but intentional behavior. The versioning in Iceberg and Nessie are different. While Iceberg has only single table are consistency, it can by its design not guarantee consistency across multiple tables and views and namespaces. Nessie can guarantee consistency across multiple entities - that's actually the central design principle. Nessie uses the "Git model" and allows "branch and merge".
The disconnect is that Iceberg's history is not necessarily the history of the table in Nessie's commit log. So with Iceberg REST we intentionally opted to expose only the snapshot that belongs to the particular Nessie commit.
You can however do time-travel with Nessie - either by specifying the commit ID or a timestamp. An example can be found here.
@snazy: What if the user wants to move the table from Nessie catalog to another catalog? In that case when they register the latest table metadata, they can't do time travel because snapshot history is not maintained?
@snazy One more question, does this mean we can not do other data operations like expiring snapshot and orphan file clean up since the only current snapshot is maintained? What is another way for DataOps on Iceberg with Nessie approach?
What if the user wants to move the table from Nessie catalog to another catalog? In that case when they register the latest table metadata, they can't do time travel because snapshot history is not maintained?
I suppose Nessie could provide a tool to write the Iceberg snapshot lineage to metadata json starting with a particular commit for the purpose of migrating the table to another catalog.
This lineage will have to be limited to direct parents because the Iceberg format allows only a simple list of snapshots.
@schobe : Re: snapshot expiry and orphan files cleanup, please considering using the Nessie GC tool for that: https://projectnessie.org/nessie-latest/gc/
Nessie internally determines a particular snapshot's "liveness" bases on the full commit graph (which can be more complex than what the Iceberg format currently supports).
What happened
Hi,
I do not see a way to query the data from old snapshot of iceberg tables. If there is a change on Iceberg table, the snapshot id is changed and pointed to new snapshot id in the history. db.table.snapshots only shows the current snapshot id and parent id is null. When I tried to query from the older snapshot using select * from db.table version as of oldsnapshot_id, it says that could not find snapshot id.
Does Nessie provide a way to query from older snapshots if the snapshots versioning is maintained via RPS? Is there any documentation around this if it is not a bug?
Thanks & Regards
How to reproduce it
Nessie server type (docker/uber-jar/built from source) and version
0.90.4
Client type (Ex: UI/Spark/pynessie ...) and version
No response
Additional information
No response