presto iceberg connector error

prestodb / presto

The official home of the Presto distributed SQL query engine for big data

http://prestodb.io

Apache License 2.0

16.07k stars 5.39k forks source link

presto iceberg connector error #16858

Open yangli9988 opened 3 years ago

yangli9988 commented 3 years ago

您好! 在通过iceberg api 删除 num =2 的数据后,presto查询结果依然会显示 num=2的数据行

presto

通过使用 flink 读取这张表,可以发现 num=2的数据被过滤,不再显示

flink

flink 使用了 iceberg-data 过滤了被删除的数据

jar

但在presto 的iceberg connector中没发现相同的过滤操作,也没有引入这jar ,希望您可以修复的这个问题,或者提供修复办法的指导谢谢

Hello!

After deleting the data of num = 2 through iceberg API, the Presto query result will still display the data row of num = 2

! presto

By reading this table with flick, you can find that the data with num = 2 is filtered and no longer displayed

! flink

Flink uses iceberg data to filter the deleted data

! jar

However, the same filtering operation was not found in the iceberg connector of presto, and the jar was not introduced. I hope you can fix this problem or provide guidance on how to fix it

thank you

@Zhenxiao Luo @Beinan Wang @Chunxu Tang。

beinan commented 3 years ago

@yangli9988 good suggestion! we didn't implement the row level deletion in the first version. but I think it's something nice to have, and it's also on our roadmap. we will take a look very soon.

Link the pr on iceberg side for row deletion -- https://github.com/apache/iceberg/pull/1309

We also encourage you to implement this feature, feel free to ping me if you need any further support

yangli9988 commented 3 years ago

@beinan 取数据的操作presto使用了自己的presto-parquet方式,而没有使用iceberg-parquet 这之间存在很大的差别,数据行的封装方式也不一样。 iceberg的 FileScanTask 中包含的 List deletes() 信息被丢弃了，剔除删除数据需要在迭代出数据行后进行过滤，这个过程不在connector中，需要修改presto的核心模块,这样可能会影响其他connector的正常执行希望您可以给我一些提示,在那些具体类可以读取完整的行数据,适合引入iceberg-data 过滤数据操作,对代码的更改量更小,并且不影响别的工程

@beinan uses its own Presto parquet method for fetching data instead of iceberg parquet, which is very different, and the encapsulation method of data rows is also different. The list < deletefile > deletes() information contained in iceberg's filescantask is discarded. The deleted data needs to be filtered after iterating out the data row. This process is not in the connector, and the core module of Presto needs to be modified, which may affect the normal execution of other connectors

I hope you can give me some tips. Complete row data can be read in those specific classes. It is suitable for introducing iceberg data filtering operation. The amount of code changes is less and does not affect other projects

beinan commented 3 years ago

@yangli9988 Good call! I just talked to the iceberg author den and blue this morning. We might prefer to use iceberg's IO classes instead presto ones in the long term. Then it would be much more easier to adopt new feature from iceberg in the future. But this might require the code change on both presto and iceberg said.

So in short term, we're happy to make a patch on the exiting presto code. I will try to go through the current implementation, and I think we can work together if you like.

yangli9988 commented 3 years ago

@beinan 我愿意一起来解决这个问题.

I am willing to work together to solve this problem

beinan commented 3 years ago

Looks like the iceberg PR is ready to review https://github.com/apache/iceberg/pull/3210