ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.53k stars 552 forks source link

feat: expression level definition of CSV and Parquet data sources #6898

Open kszucs opened 10 months ago

kszucs commented 10 months ago

Is your feature request related to a problem?

I'd like to be able to build up an expression based on local files and easily execute it on multiple backends:

t = ibis.read_parquet("...path to parquet file...", name="t")

expr = ...  # build up some expression depending on the source parquet file

# execute it using DuckDB
ibis.duckdb.execute(expr)

# decide to use a different backend later on
ibis.polars.execute(expr)

Describe the solution you'd like

The expression system should contain specific operations describing various data sources. This way ibis expression can be built in a backend agnostic manner.

What version of ibis are you running?

main branch

What backend(s) are you using, if any?

DuckDB, Polars, DataFusion, Pandas

Code of Conduct

lostmygithubaccount commented 10 months ago

what's your use case for this out of curiosity?

cpcloud commented 10 months ago

The primary technical challenge here is finding a fast and robust schema inference solution for files that aren't self-describing such as CSV and JSON.

The current landscape is a bit dicey:

CSV

JSON