trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.19k stars 2.94k forks source link

Support branching of Iceberg tables #12844

Open findinpath opened 2 years ago

findinpath commented 2 years ago
trino:default> insert into "test1@3692406374410249224" values (10);
INSERT: 1 row

Query 20220614_205840_00020_m6z2x, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
1.16 [0 rows, 0B] [0 rows/s, 0B/s]

trino:default> insert into test1 FOR VERSION AS OF 3692406374410249224 values (10);
Query 20220614_205904_00021_m6z2x failed: line 1:19: mismatched input 'FOR'. Expecting: '(', <query>
insert into test1 FOR VERSION AS OF 3692406374410249224 values (10)

By using the now deprecated syntax, we are able to do insertions into a versioned Iceberg table. However, when using the syntax AS OF this functionality is not anymore supported.

This functionality will likely require the grammar SqlBase.g4 to be extended.

alexjo2144 commented 2 years ago

This shouldn't work right? You should only be able to write data to the most recent version of a table.

electrum commented 2 years ago

This was never intended behavior. Does this do anything different than inserting into the base table?

homar commented 2 years ago

Inserting into an old snapshot looks strange. Instead of linear history we would get a tree.

findinpath commented 2 years ago

Instead of linear history we would get a tree.

Think of zero-copy clones. Instead of copying GB of data with a CTAS statement and waiting until a copy of the table is done we could have several independent "branches" of a table.

findinpath commented 2 years ago

Does this do anything different than inserting into the base table?

No, it doesn't at the time of this writing.

alexjo2144 commented 2 years ago

I'd reword this as a broader issue to support branching and tagging then. I'm not sure how far that project has made it into implementation though. @rdblue would know.

rdblue commented 2 years ago

What is the actual behavior when you write to a table at some version? Does it create a snapshot with that version as the parent?

Right now, branching and tagging is about adding the metadata to the table. We haven't added the ability to commit to a branch yet. We are thinking that could look something similar to what is above, but the branch needs to be identified by name. In the Iceberg API, we'll add toBranch:

table.newAppend()
    .toBranch("test")
    .addFile(FILE_A)
    .commit();

In SQL, that could be INSERT INTO 'table@branch' ... but it could also be INSERT INTO table BRANCH branch .... It's up to engines to decide (though hopefully we can agree on similar syntax).

I don't think that we would allow doing the same thing with tags or individual commits (which are very similar to tags). Since those represent points in time, it makes no sense.

martint commented 2 years ago

By using the now deprecated syntax, we are able to do insertions into a versioned Iceberg table.

That should probably be rejected, as it’s semantically meaningless and potentially confusing based on current behavior. It’s impossible to insert into a specific version of a table without the versión fundamentally changing.

Think of zero-copy clones. Instead of copying GB of data with a CTAS statement and waiting until a copy of the table is done we could have several independent "branches" of a table.

There’s no concept of branches in standard SQL. That brings up all sorts of complexities such as:

Also, note that a CTAS doesn’t necessarily need to copy data or be “slow”. It’s an implementation detail whether the data is read and written or a more efficient approach such as creating a O(1) snapshot is used (not currently supported by connector APIs or the engine, though)

findepi commented 2 years ago

By using the now deprecated syntax, we are able to do insertions into a versioned Iceberg table.

That should probably be rejected, as it’s semantically meaningless and potentially confusing based on current behavior. It’s impossible to insert into a specific version of a table without the versión fundamentally changing.

agreed. @findinpath can you please file a separate bug issue about this?

findinpath commented 2 years ago

@findepi I've created https://github.com/trinodb/trino/issues/12860

rdblue commented 2 years ago

Can you ever merge branches or are they effectively independent tables at that point?

@martint, I am pushing against any expectation that branches are ever merged. The use cases we're targeting are:

Merging branches is a really bad idea because it is essentially a transaction that is open for a long time. The chances of being able to commit the transaction safely get lower and lower as the branch ages. And what's even worse is that you'd have to encode all of the transaction information (like what other tables were read and the versions that were read) into the branch. Branches for transactions are a really, really bad idea.

itaise commented 11 months ago

@rdblue Are iceberg branches fully supported in Trino? I can't find resources which shows the relevant DDL, thanks!

findinpath commented 11 months ago

@itaise Are iceberg branches fully supported in Trino?

Trino supports now querying by tag/branch name - see https://trino.io/docs/current/connector/iceberg.html#time-travel-queries However, Trino does not have currently support for creating tags or branches.

nicor88 commented 11 months ago

@findinpath is there any plan to support "creating" tags or branches from Trino? This will make easy to implement WAP for example with dbt.

findinpath commented 11 months ago

@nicor88 this topic is actually currently being analyzed. Please dive in with a more detailed usage (the more details the better) scenario of what you have in mind.

itaise commented 11 months ago

I will just say that what we would like to do with data quality is a WAP process like in this article: https://www.dremio.com/blog/streamlining-data-quality-in-apache-iceberg-with-write-audit-publish-branching/ Write to a specific branch, audit this branch (with data quality tools) and if results are good, make this branch the main branch.

kmurra commented 10 months ago

Hello @findinpath, we have a use case for a WAP (write-audit-publish) workflow. Basically, we want our users to have a good way to test changes to their transform logic. Due to compliance requirements, we can't use production data in our test environment, but it is very difficult to get representative data in our test environment from test systems. Since our organization uses a service-oriented architecture, test environments have test data populated in isolation and so invariants are often not respected (i.e. a transaction must be assigned to an existing customer with contact information -- always true in prod, not so in test).

A write-audit-publish workflow fixes this because we could have users check in their code and deploy it to some environment between test and prod. The code would run against a branch of production data, and then verify the two branches of data. If the user thinks the changes are fine, we publish the data, and make future runs of their transform logic automatically publish the change.

Essentially, we would like a feature that is similar to Snowflake's zero-copy clone.

I can go into more detail if necessary.

lorransr commented 6 months ago

We have a similar use case for WAP:

Currently, we use DBT to transform data, and more than 40 people work concurrently on our project. We have duplicated our lake zones so that each developer can run their tests. This duplicated infrastructure is also used to run our CI, ensuring that any modification won’t break our pipeline.

To reduce data duplication, we could use WAP. Another case is testing incremental problems. For example, I could create a branch from a table and test if the DBT incremental model would work on that table. We already face challenges in ensuring the same behavior as in production, and WAP makes this possible.

Another option is to include a step in our pipeline. For instance, for some tables, we need to ensure a certain level of quality. Therefore, we only publish if we meet this quality score, so we branch them, test them, and if everything is ok we commit the changes

JunseoChoJJ commented 4 months ago

@lorransr then do you use nessie catalog?

lorransr commented 4 months ago

@JunseoChoJJ we use tabular as our data catalog

konovenski commented 6 days ago

Is this still considered? I'm thinking that it is a really good feature for testing pipelines for over large datasets without replication