trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.49k stars 3.02k forks source link

Incremental read support for Iceberg tables #8780

Open skandasa23 opened 3 years ago

skandasa23 commented 3 years ago

Trino currently supports reading data belonging to a particular Iceberg snapshot. Incremental read support helps to read only the changed data between snapshots. Not sure of the Trino convention but something like this; select count(*) from iceberg.testdb."table@{S1,S2}" - outputs only the inserted rows between S1(exclusive) and S2(inclusive).

Spark support;

  1. append support after https://github.com/apache/iceberg/pull/315.
  2. delete/overwrite mutations is work in progress : https://github.com/apache/iceberg/pull/2782
findepi commented 3 years ago

That easy to imagine when files are just being added to a table -- you would just do a set operation on file names between snapshots How does spark handle situation when files are replaced (updated rows, small files combined into bigger, etc.)? Or when there are v2 deletion files?