Open lostmygithubaccount opened 9 months ago
I see a number of technical issues with the iceberg python client that I think are blockers for using it as the basis for iceberg support in Ibis:
At the very least, we'd need to be able to get back a PyArrow Dataset that can be streamed into a query engine like DuckDB before we can consider using the iceberg python client.
I think a better option might be https://duckdb.org/docs/extensions/iceberg.html at least for DuckDB.
Main blocker is write support, tracked here: apache/iceberg-python#23
This, at least, is resolved. :)
@deepyaman any interest in taking a stab at this?
@deepyaman any interest in taking a stab at this?
Is this ready for implementation? It seems @cpcloud's first concern is resolved, but is the second one? We could get a pa.Table
from DataScan.to_pyarrow
, but that would mean not pushing down the projection and/or leaving the execution up to pyiceberg
.
I see a number of technical issues with the iceberg python client that I think are blockers for using it as the basis for iceberg support in Ibis:
- It appears to only support in-memory results. This seems like it defeats the purpose of using iceberg in python unless you can guarantee your projections and filters are selective enough that they allow results to fit in memory.
- The python client seems to want to own any compute related to projections and filters, which again seems to defeat the purpose of decoupling storage and compute.
As far as I know, Iceberg is a table format for compute engines (e.g., Spark) to work with. Along that line, I think it is expected for pyiceberg
to execute projections and filters in memory in the absence of an intermediate compute engine. Iceberg maintains a rich set of meta-data for the tables, which enables scanning
the meta-data to (significantly) reduce the number of (partition) files pulled. However, yes, it is on the user to make sure the results fit in memory.
As raised in 2. above, pyiceberg.to_arrow()
first calls plan_files()
to get a list of relevant files, then calls project_table()
to run the projections and filters in-memory and returns the data in a pyarrow
table.
At the very least, we'd need to be able to get back a PyArrow Dataset that can be streamed into a query engine like DuckDB before we can consider using the iceberg python client.
Looking at the implementation of pyiceberg.to_arrow()
, my initial impression is that it should be straightforward to (1) scan the Iceberg table and pull only the relevant files, (2) put the files in a pyarrow
dataset.
@cpcloud Do these points make sense to you? If they do, I can take a stab at this issue.
Disclaimer: My understanding of Iceberg might not be fully correct as my knowledge of Iceberg is limited :)
Refs:
Is your feature request related to a problem?
Support Iceberg tables in Ibis
Using: https://github.com/apache/iceberg-python
Main blocker is write support, tracked here: https://github.com/apache/iceberg-python/issues/23
Describe the solution you'd like
ibis.read_iceberg
table.to_iceberg
What version of ibis are you running?
n/a
What backend(s) are you using, if any?
local backends that would support this
Code of Conduct