ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.14k stars 590 forks source link

feat: Iceberg table support #7712

Open lostmygithubaccount opened 9 months ago

lostmygithubaccount commented 9 months ago

Is your feature request related to a problem?

Support Iceberg tables in Ibis

Using: https://github.com/apache/iceberg-python

Main blocker is write support, tracked here: https://github.com/apache/iceberg-python/issues/23

Describe the solution you'd like

ibis.read_iceberg

table.to_iceberg

What version of ibis are you running?

n/a

What backend(s) are you using, if any?

local backends that would support this

Code of Conduct

cpcloud commented 9 months ago

I see a number of technical issues with the iceberg python client that I think are blockers for using it as the basis for iceberg support in Ibis:

  1. It appears to only support in-memory results. This seems like it defeats the purpose of using iceberg in python unless you can guarantee your projections and filters are selective enough that they allow results to fit in memory.
  2. The python client seems to want to own any compute related to projections and filters, which again seems to defeat the purpose of decoupling storage and compute.

At the very least, we'd need to be able to get back a PyArrow Dataset that can be streamed into a query engine like DuckDB before we can consider using the iceberg python client.

cpcloud commented 9 months ago

I think a better option might be https://duckdb.org/docs/extensions/iceberg.html at least for DuckDB.

deepyaman commented 8 months ago

Main blocker is write support, tracked here: apache/iceberg-python#23

This, at least, is resolved. :)

lostmygithubaccount commented 8 months ago

@deepyaman any interest in taking a stab at this?

deepyaman commented 8 months ago

@deepyaman any interest in taking a stab at this?

Is this ready for implementation? It seems @cpcloud's first concern is resolved, but is the second one? We could get a pa.Table from DataScan.to_pyarrow, but that would mean not pushing down the projection and/or leaving the execution up to pyiceberg.

mfatihaktas commented 7 months ago

I see a number of technical issues with the iceberg python client that I think are blockers for using it as the basis for iceberg support in Ibis:

  1. It appears to only support in-memory results. This seems like it defeats the purpose of using iceberg in python unless you can guarantee your projections and filters are selective enough that they allow results to fit in memory.
  2. The python client seems to want to own any compute related to projections and filters, which again seems to defeat the purpose of decoupling storage and compute.

As far as I know, Iceberg is a table format for compute engines (e.g., Spark) to work with. Along that line, I think it is expected for pyiceberg to execute projections and filters in memory in the absence of an intermediate compute engine. Iceberg maintains a rich set of meta-data for the tables, which enables scanning the meta-data to (significantly) reduce the number of (partition) files pulled. However, yes, it is on the user to make sure the results fit in memory.

As raised in 2. above, pyiceberg.to_arrow() first calls plan_files() to get a list of relevant files, then calls project_table() to run the projections and filters in-memory and returns the data in a pyarrow table.

At the very least, we'd need to be able to get back a PyArrow Dataset that can be streamed into a query engine like DuckDB before we can consider using the iceberg python client.

Looking at the implementation of pyiceberg.to_arrow(), my initial impression is that it should be straightforward to (1) scan the Iceberg table and pull only the relevant files, (2) put the files in a pyarrow dataset.

@cpcloud Do these points make sense to you? If they do, I can take a stab at this issue.

Disclaimer: My understanding of Iceberg might not be fully correct as my knowledge of Iceberg is limited :)

Refs: