ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.02k stars 585 forks source link

feat: `to_duckdb` #9629

Closed MarcoGorelli closed 1 month ago

MarcoGorelli commented 1 month ago

Is your feature request related to a problem?

Say I have a iris.parquet file and read it using duckdb:

import ibis
import duckdb
con = duckdb.connect()
rel = con.read_parquet('iris.parquet')

rel is a DuckDBPyRelation object

I'd like to be able to connect to rel using Ibis, perform some operations using Ibis' API, and then get a DuckDBPyRelation out at the other end

So, to do something like:

tbl = ibis.read_duckdb(rel)  # Is there an existing API to do this? I know that I can read an in-memory table, or a csv file,
# but how do I connect to a DuckDB relation?
tbl = tbl.filter(ibis._['sepal_width'] > 4)
rel = tbl.to_duckdb()

such that, if I'd instead done

rel = rel.filter('sepal_width > 4')

then the result would have been the same

What is the motivation behind your request?

I'm very grateful that you're addressing https://github.com/ibis-project/ibis/pull/9564 and https://github.com/ibis-project/ibis/pull/9552, and I'm evaluating using Ibis in Narwhals

Describe the solution you'd like

described above

What version of ibis are you running?

9.1.0

What backend(s) are you using, if any?

duckdb

Code of Conduct

lostmygithubaccount commented 1 month ago

can you describe the user case for this? I believe you'd just go through Arrow on both ends

https://duckdb.org/docs/guides/python/import_arrow

https://duckdb.org/docs/guides/python/export_arrow

+ t = ibis.memtable(rel.arrow(), t.to_pyarrow() or t.to_pyarrow_batches()

lostmygithubaccount commented 1 month ago

other options could include creating it as a table/view/temp table or whatever in your duckdb database, and accessing it as a table in Ibis, but understanding the motivations behind this would help guide recommendations

MarcoGorelli commented 1 month ago

Say for example that I want to write a function to create squared features. Such a function can stay completely lazy, there's no need to go via pyarrow - in Polars for example I'd do:

def add_squared_features_polars(df: pl.LazyFrame, columns: list[str]) -> pl.LazyFrame:
    return df.with_columns((pl.col(col)**2).alias(f'{col}_squared') for col in columns)

I can also do that if I start with an ibis table (and this can also stay totally lazy without having to go via pyarrow)

def add_squared_features_ibis(tbl: ibis.Table, columns: list[str]) -> tbl.ibis.Table:
    return tbl.mutate((pl.col(col)**2).rename(f'{col}_squared') for col in columns)

I'm trying to see if I can do the same for a DuckDB Relation, and to see if I can re-use Ibis for that. I'd need to be able to do something like this:

def add_squared_features_duckdb(rel: duckdb.Relation, columns: list[str]) -> duckdb.Relation:
    # create ibis table from duckdb relation
    tbl = ibis.from_duckdb(rel)
    # do some operation with the Ibis API, which stays totally lazy
    tbl = tbl.mutate((pl.col(col)**2).rename(f'{col}_squared') for col in columns)
    # return the underlying duckdb relation
    return tbl.to_duckdb()

Is this possible?

Thanks


Solution without going through Ibis:

def add_squared_features_duckdb(rel: duckdb.Relation, columns: list[str]) -> duckdb.Relation:
    query = ','.join(f'{col}**2 as {col}_squared' for col in columns) + ', *'
    return rel.select(query)

I was trying to see whether Ibis could help me achieve that, so that instead of doing all this string parsing, I could just use Ibis' API

jcrist commented 1 month ago

If there was a way to get the backing duckdb connection from a DuckDBPyRelation object, then with #9603 you'd be able to create a bound ibis.Table mirroring the duckdb relation. AFAICT though this attribute/method doesn't exist in duckdb at the moment, and would require an upstream issue there.

With that, then you could use the bound ibis.Table to construct queries. With no additional work from us you could use ibis.to_sql(expr, dialect="duckdb") to get the SQL, then manually execute it on the duckdb.DuckDBPyConnection object yourself. Alternatively, a to_duckdb_relation method on our duckdb backend might make sense. I don't think any other code changes on our side would make sense though.

The full workflow would be something like:

duckdb_con = relation...  # TODO - get duckdb connection backing relation

con = ibis.duckdb.from_connection(duckdb_con)

ibis_table = con.sql(duckdb_con.sql_query())  # create an ibis table from the relation

query = ... # do your transforms in ibis

result = duckdb_con.sql(con.compile(query))  # return a new relation

# or if we add a `to_duckdb_relation` method
result = con.to_duckdb_relation(query)

Alternatively, you could also make use of only unbound tables (tables constructed without a backend connection in ibis). These could then be used to generate the SQL you want without having ibis do any backend-related things for you. The trick then is that you'd need to manage coercing the relation object to a matching ibis expression, which may be tricky with the limited metadata a relation object exposes.

# left as an exercise for the reader
ibis_schema = get_ibis_schema_from_duckdb_relation(...)

ibis_table = ibis.table(schema=ibis_schema, name="some_name_here")

Anyway, hopefully that's enough for you to go on here. As I said above, a to_duckdb_relation method (or some better name) on the ibis duckdb backend class might make sense, but I'm not sure if we can offer any more support beyond that.

jcrist commented 1 month ago

Going to close this for now. If you decide a con.to_duckdb_relation is something you'd use (given my comment above), please feel free to reopen.