Open snazy opened 3 years ago
Since we are talking about intelligent merge operations / content manipulation, would it make sense to support multiple parents in Nessie commits (like in git)?
With a content-aware merge, I guess the contents on the base branch may have non-trivial differences from both old base contents and the contents being merged. Therefore, it might be valuable to preserve the lineage of changes (unless the merge is fast-forward).
Few observations -
Unclear, whether we have to cherry-pick all Iceberg snapshots since the common ancestor or whether it's sufficient to just cherry-pick the most recent Iceberg snapshot (and the recent schema). Technically, the most recent Iceberg snapshot (and current schema) should be sufficient. But without the intermediate snapshots the change history provided by the Iceberg snapshots would be lost or become incomplete.
SnapshotManager.cherrypick()
handles this via delta added/deleted data files; but ignores existing data files assuming they're unchanged. For merge case, we'll need an aggregated view of added/deleted data files from all the snapshots from the point of fork. The two approaches are to cherry pick one by one and to aggregate and merge in a single snapshot. In DeltaLake too, each commit log file maintains only the delta.
Some thoughts in favour of cherry-picking each commit -
Cons of cherry-picking each commit -
Merge operations in Nessie "only" copy one or more commits from one reference onto another, since the common ancestor. Nessie itself does not interpret the meaning of the contents in the commits. While the Nessie merge operation is technically correct and works as designed, they prevent multiple, "nested" merge operations.
Example:
(Note: the above behavior is true for all Nessie versions)
I think, we have to have a "Nessie aware merge operation" in Iceberg itself, that properly
From some investigation, Iceberg already contains the code for the building blocks:
SchemaUpdate.unionByNameWith()
can merge twoSchema
objectsSnapshotManager.cherrypick(long snapshotId)
to cherry-pick one snapshotSchema.sameSchema()
can compare two schema objects (semantically equivalent)What's missing:
Table
PendingUpdate.commit()
leading to a single Nessie commit)Unclear, whether we have to cherry-pick all Iceberg snapshots since the common ancestor or whether it's sufficient to just cherry-pick the most recent Iceberg snapshot (and the recent schema). Technically, the most recent Iceberg snapshot (and current schema) should be sufficient. But without the intermediate snapshots the change history provided by the Iceberg snapshots would be lost or become incomplete.
Not sure if
SnapshotManager.cherrypick(long snapshotId)
already tackles it: the "snapshot log" inTableMetadata
must stay consistent.I also think, that the functionality to do the above is not purely related to Nessie - it does not even have to touch Nessie code in Iceberg. It is strictly speaking "just" Iceberg functionality that produces a new
TableMetadata
, which then gets commited via theNessieTableOperations
.We can probably implement it as an Iceberg procedure next to
CherrypickSnapshotProcedure
for Spark 3.xThe same mechanism should also be done for Deltalake, but better as a separate issue / PR.