ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.42k stars 544 forks source link

feat(ux): add duckdb as the default backend #4489

Closed cpcloud closed 1 year ago

cpcloud commented 1 year ago

This PR adds DuckDB as the default backend to use for things that don't have a backend.

Right now, those things are:

  1. ops.Literals
  2. ops.InMemoryTables

This is useful for doing analysis on in-memory things without having to think too hard about how to do it. The workflow is one call to ibis.memtable on the object and you're off to the races.

github-actions[bot] commented 1 year ago

Test Results

       6 files         6 suites   4m 15s :stopwatch: 3 196 tests 3 182 :heavy_check_mark: 14 :zzz: 0 :x: 9 588 runs  9 546 :heavy_check_mark: 42 :zzz: 0 :x:

Results for commit 7fc41465.

:recycle: This comment has been updated with latest results.

cpcloud commented 1 year ago

I'll change the default_backend_func to be ibis.config._default_backend_func

codecov[bot] commented 1 year ago

Codecov Report

Merging #4489 (b59da15) into master (0242d1a) will increase coverage by 0.00%. The diff coverage is 78.78%.

:exclamation: Current head b59da15 differs from pull request most recent head 7fc4146. Consider uploading reports for the commit 7fc4146 to get more accurate results

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4489   +/-   ##
=======================================
  Coverage   92.67%   92.68%           
=======================================
  Files         181      181           
  Lines       20885    20907   +22     
  Branches     2983     2987    +4     
=======================================
+ Hits        19356    19377   +21     
- Misses       1151     1152    +1     
  Partials      378      378           
Impacted Files Coverage Δ
ibis/config.py 73.33% <53.33%> (-4.00%) :arrow_down:
ibis/backends/dask/core.py 94.73% <100.00%> (ø)
ibis/backends/pandas/core.py 96.33% <100.00%> (ø)
ibis/expr/types/core.py 94.30% <100.00%> (+2.06%) :arrow_up:
ibis/backends/impala/__init__.py 86.14% <0.00%> (+0.21%) :arrow_up:
ibis/common/pretty.py 100.00% <0.00%> (+9.67%) :arrow_up:
cpcloud commented 1 year ago

does this not require an additional import?

Can you clarify where you think an additional import is required?

Also, can you avoid using the "Request Changes" functionality to ask questions and reserve it only for requesting specific changes?

jreback commented 1 year ago

this makes duckdb a required import now

i think that is a major breaking change

kszucs commented 1 year ago

Could we fall back to pandas if duckdb is not available? Pandas should be a mandatory dependency, at least for now.

jreback commented 1 year ago

falling back to pandas seems reasonable though their is a failed import cost constantly if that happens which may not be trivial

jreback commented 1 year ago

not averse to in the future having duckdb as a default

but that needs to be a separate and deliberate decision

cpcloud commented 1 year ago

this makes duckdb a required import now

i think that is a major breaking change

DuckDB is not a required import and so this isn't a breaking change.

Did you read the code? Did you notice that there are no changes to pyproject.toml? Can you demonstrate a scenario where this is a required import? Most of the CI jobs do not install duckdb, and yet somehow they continue to pass!

but that needs to be a separate and deliberate decision

The PR is the deliberate decision.

If you can demonstrate a specific verifiable problem then we can continue the discussion. Otherwise, this PR is ready to merge.

jreback commented 1 year ago

The PR is the deliberate decision.

where exactly is this discussion?

this is such a huge change that it needs discussion

and would be -1 at this time

jreback commented 1 year ago

further i think you need to prove that this isn't a major performance regression in thr lack of import

oh and since when do we have a default backend?

cpcloud commented 1 year ago

The PR is the deliberate decision.

where exactly is this discussion?

There's no requirement to have a discussion before putting up a PR, that would be ridiculously prohibitive to anyone making changes.

If anyone has actionable objections they are welcome to describe them and comment on the PR.

this is such a huge change that it needs discussion

and would be -1 at this time

You still haven't demonstrated what the problem is. Again, there is no requirement to have duckdb installed. Can you describe a scenario where the import is required?

jreback commented 1 year ago

you are skipping the test if it's not installed

cpcloud commented 1 year ago

I've added tests that cover both cases, and also made it cheap to fail: the import failure overhead is incurred once per process. I've also made it cheap to succeed: only one in-memory connection for the default backend is ever created.

After everything's green, I'm going to merge the PR.

cpcloud commented 1 year ago

@jreback

In general, one person's -1 is not enough to tank a PR, especially if it's already been approved. If you want to start contributing again, then your input will carry more weight.

I know that you've contributed a lot in the past, but continued interaction in some other form besides blocking PRs is what carries weight.

This applies to anyone who wants to contribute.

Adding these tests and performance enhancements is a courtesy and I will not be doing things like this going forward.

jreback commented 1 year ago

@cpcloud that is not a great attitude at all

you are aware we heavily rely on ibis and are in collaboration with you and others

we deeply care about the breaking changes and is it important to telegraph these and not simply override eveyone

-1 but a core maintainer should trigger discussion otherwise this is not open source

cpcloud commented 1 year ago

you are aware we heavily rely on ibis and are in collaboration with you and others

I am very aware of that.

We have spent a lot of time and effort accommodating your requests to avoid going to 4.0.0 immediately after 3.1.0 to avoid a large swath of breaking changes. We have done so by splitting our development into two separate branches, causing a lot of additional unnecessary rebase work to be done.

We did that because we value users' input and decided it was in the best interest of the project to give you more time to catch up.

I think there's been plenty of good will from our side.

All I'm asking for is a larger quantity of and higher quality maintenance work from you, in the form of actions other than blocking reviews with questions that no reasonable person can understand. We aren't going to try and guess what your concerns are.

we deeply care about the breaking changes and is it important to telegraph these and not simply override eveyone

We also care about breaking changes otherwise we'd have released 4.0 a long time ago, but this isn't a breaking change, so there's not much to telegraph beyond a release note in the Features section of the release notes for 3.2.

-1 but a core maintainer should trigger discussion otherwise this is not open source

Discussion is highly encouraged. Blocking PRs with a single nondescript question is not.

cpcloud commented 1 year ago

To clarify the bit about there not being any breaking changes in this PR:

Code that previously failed

No backend was found and the expression isn't executable

This will fail with the same exception type, with a slightly different and more informative error message.

Code that previously found a backend

This will hit the same code path as it's hitting now and continue to work.

So what's actually new, then?

No backend found AND there are no UnboundTables in the expression

Within this scenario there are two cases to think about.

duckdb is NOT installed

There's exactly ONE attempt to import duckdb. If that fails, the import is never attempted again in the same scenario, and you'll get the same behavior (again, modulo the error message) as before. This means that the behavior without duckdb installed is the same as it was before, with the overhead of a single failed import.

duckdb IS installed

A single in-memory duckdb backend instance will be created and it will execute the expression. There are no expressions that are executable without a backend except those that are backed by ops.InMemoryTables and ops.Literals. Since ops.InMemoryTables are new in 3.2, nothing is broken, only added. ops.Literals successfully executing is a new feature that comes along with this. Again, this is all only true if duckdb is installed.

Please DO NOT skim this

Apologies for the SHOUTING BOLD CASE, but I really want to highlight the important bits that should not be skimmed over.

cpcloud commented 1 year ago

Alright, this has been approved and contains no breaking changes that we are aware of, with the rationale explicitly stated above, so I'm going to merge it.

We're happy to address fixes for this if they are needed in a subsequent release and we'll create patch releases for that from a branch.

After this is green in the master branch, I'll cut the 3.2.0 release 🎉

jreback commented 1 year ago

still beating a dead horse with my objection that this is semantically changing execution

we now can execute inbound tables implicitly - this breaks a tenant of ibis that execution must specify a backend

this is actually a slippery slope

who's to say that we should just always default to x engine and try to execute

i think this is bad user semantics and is too implicit with a very narrow good case (eg that is what u want)

the shouting is not helping anyone here and is distracting