projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics
https://projectnessie.org
Apache License 2.0
1.04k stars 129 forks source link

Content aware merge operations #2513

Open snazy opened 3 years ago

snazy commented 3 years ago

Merge operations in Nessie "only" copy one or more commits from one reference onto another, since the common ancestor. Nessie itself does not interpret the meaning of the contents in the commits. While the Nessie merge operation is technically correct and works as designed, they prevent multiple, "nested" merge operations.

Example:

CREATE TABLE foo...;
-- User 1
CREATE BRANCH branch_a;
INSERT INTO foo ('abc');
-- User 2
CREATE BRANCH branch_b;
INSERT INTO foo ('def');
-- User 1
MERGE branch_a INTO main;
SELECT * FROM foo ... ; -- returns 'abc'
-- User 2
MERGE branch_b INTO main; -- CONFLICT

(Note: the above behavior is true for all Nessie versions)

I think, we have to have a "Nessie aware merge operation" in Iceberg itself, that properly

From some investigation, Iceberg already contains the code for the building blocks:

What's missing:

Unclear, whether we have to cherry-pick all Iceberg snapshots since the common ancestor or whether it's sufficient to just cherry-pick the most recent Iceberg snapshot (and the recent schema). Technically, the most recent Iceberg snapshot (and current schema) should be sufficient. But without the intermediate snapshots the change history provided by the Iceberg snapshots would be lost or become incomplete.

Not sure if SnapshotManager.cherrypick(long snapshotId) already tackles it: the "snapshot log" in TableMetadata must stay consistent.

I also think, that the functionality to do the above is not purely related to Nessie - it does not even have to touch Nessie code in Iceberg. It is strictly speaking "just" Iceberg functionality that produces a new TableMetadata, which then gets commited via the NessieTableOperations.

We can probably implement it as an Iceberg procedure next to CherrypickSnapshotProcedure for Spark 3.x

The same mechanism should also be done for Deltalake, but better as a separate issue / PR.

dimas-b commented 3 years ago

Since we are talking about intelligent merge operations / content manipulation, would it make sense to support multiple parents in Nessie commits (like in git)?

With a content-aware merge, I guess the contents on the base branch may have non-trivial differences from both old base contents and the contents being merged. Therefore, it might be valuable to preserve the lineage of changes (unless the merge is fast-forward).

harshm-dev commented 3 years ago

Few observations -

Unclear, whether we have to cherry-pick all Iceberg snapshots since the common ancestor or whether it's sufficient to just cherry-pick the most recent Iceberg snapshot (and the recent schema). Technically, the most recent Iceberg snapshot (and current schema) should be sufficient. But without the intermediate snapshots the change history provided by the Iceberg snapshots would be lost or become incomplete.

SnapshotManager.cherrypick() handles this via delta added/deleted data files; but ignores existing data files assuming they're unchanged. For merge case, we'll need an aggregated view of added/deleted data files from all the snapshots from the point of fork. The two approaches are to cherry pick one by one and to aggregate and merge in a single snapshot. In DeltaLake too, each commit log file maintains only the delta.

Some thoughts in favour of cherry-picking each commit -

Cons of cherry-picking each commit -

snazy commented 1 year ago

6631 adds the Nessie side support for this