ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.08k stars 586 forks source link

feat: will PyArrow+pandas be made optional for backends? #10120

Closed MarcoGorelli closed 13 hours ago

MarcoGorelli commented 5 days ago

Is your feature request related to a problem?

As far as I understand, pandas+pyarrow are now optional for pip install ibis-framework, but still required for all backends

What is the motivation behind your request?

There was a request recently in Narwhals that I thought Ibis might be better suited for, but the poster responded with

def don't want anything to do with pyarrow as a dependency 😁

Describe the solution you'd like

Would you consider making PyArrow / pandas optional for backends?

What version of ibis are you running?

9.5.0

What backend(s) are you using, if any?

No response

Code of Conduct

lostmygithubaccount commented 5 days ago

asked over there for the rationale -- one of the engineers can weigh in but my understanding is it's still a good amount of work with fairly minimal benefit for users. the main reason cited in the past has been running in AWS Lambda and other FaaS, but you can very easily use PyArrow or other larger dependencies in those tools (i.e. I don't think this was ever a particularly valid reason, so would be great to understand this person's perspective)

MarcoGorelli commented 3 days ago

thanks for your response!

just for my understanding - supposing it were possible, would you be open to such a PR?

lostmygithubaccount commented 3 days ago

I personally don't see why we wouldn't. I think given infinite time and resources, this is definitely something we would do -- Phillip already made it possible as you note without a backend. of course, we'd want to ensure no functionality is lost. it'd be good to have the engineers weigh in (we'll discuss this at some point this week and can respond back here if they don't already from the GH notifications)

kylebarron commented 2 days ago

FWIW I'm also interested in using ibis without requiring pyarrow as a dependency. I don't have anything against pyarrow personally, but it's a very big dependency to force on all users of a library (see the pandas v3 discussion) and with the Arrow PyCapsule Interface it's now a lot easier to use alternative, smaller Python Arrow implementations, like nanoarrow or my own.

If substrait is now maturing, then any backend that can consume substrait (e.g. at least DuckDB) could in theory remove the pyarrow dependency pretty easily?

lostmygithubaccount commented 2 days ago

If substrait is now maturing, then any backend that can consume substrait (e.g. at least DuckDB) could in theory remove the pyarrow dependency pretty easily?

I don't think these things are related -- the long-term vision is substrait as intermediary representation (and Ibis can already produce Substrait plans), but I wouldn't expect Ibis to "switch" anytime soon for a bunch of technical/data system adoption reasons (e.g. DuckDB's Substrait consumption tends to be far more buggy than SQL)

not that it's hard to find but link to the pandas discussion for context: https://github.com/pandas-dev/pandas/issues/57073

nanoarrow (or arro3) does seem like an interesting option but we're beyond my technical depth 😄

gforsyth commented 13 hours ago

I'm closing this out in favor of #10166 -- TLDR; we're interested in making sure that our usage of pandas and pyarrow are cleanly separable from other backend functionality, but we aren't (in the short-term) going to remove pyarrow as a dependency because we don't want new users to have to install multiple extras to have a functional ibis installation.