ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.3k stars 537 forks source link

perf: Ibis slows down considerably for wider tables #9111

Open binste opened 2 weeks ago

binste commented 2 weeks ago

What happened?

In a web application, I'm creating rather complicated Ibis expressions in the backend. The execution of the final expression takes < 1 second but creating the expression itself took up to 30 seconds. Took me a while to figure out what's going on but I think it's because Ibis gets considerably slower for wider tables. I don't need all columns in my app and so I was now able to improve performance by just pruning away right from the beginning what I don't need.

However, in case you see any improvement potential, below some example code to demonstrate it. If this is just inherent to Ibis, what do you think about a note in the documentation? I only found #6832 which is somewhat related.

Setup

import ibis
import ibis.selectors as s
from ibis import deferred as d_

t = ibis.table(name="base1", schema={f"col{i}": "int64" for i in range(200)})

Drop and relocate

In more complex expressions, having multiple of these drop statements can quickly sum up:

t1 = t.drop("col0", "col1", "col2")  # 500ms

Same for relocate:

t1 = t.relocate("col100", before="col2")  # 1,700ms -> 1.7 seconds

select and mutate do not have this issue:

t1 = t.select("col0", "col1", "col2")  # 3ms
t1 = t.mutate(colnew=ibis.literal(1))  # 20ms

Selectors

I've also noticed that Ibis selectors can be much slower than using a pure Python implementation:

t.mutate(s.across(s.c(*[f"col{i}" for i in range(20)]), d_.cast("str")))  # 580ms
t.mutate(**{f"col{i}": d_[f"col{i}"] for i in range(20)})  # 70ms

What version of ibis are you using?

9.0.0

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

binste commented 2 weeks ago

I've benchmarked Ibis 8 vs. 9 on 55 Ibis expressions which are part of a Data Warehouse built using https://github.com/binste/dbt-ibis. I've measured:

Here some pseudo code to illustrate

import ibis

t = ibis.table(...)

# --------------------------
# START: Execution of Ibis code
# --------------------------
t = t.group_by(...).select(...)

# --------------------------
# END: Execution of Ibis code
# --------------------------

# --------------------------
# START: Convert Ibis expression to SQL
# --------------------------
ibis.to_sql(t)
# --------------------------
# END: Convert Ibis expression to SQL
# --------------------------

image

image

The great news is that the compilation to SQL got significantly faster with the move to SQLGlot which is super nice! :) The execution of Ibis code on the other hand got a bit slower with one expression taking significantly longer with 11 seconds. I've profiled that expression and most time is spent in the following statements:

Hope this helps!

cpcloud commented 2 weeks ago

@binste Thanks for digging into this. Interesting results!

Can you share the query that's now taking 11 seconds with Ibis 9.0?

cpcloud commented 2 weeks ago

Regarding drop and relocate, they're both implemented using a somewhat naive approach:

I suspect that for drop we can turn it into a special operation that we should at least be able to make the operation scale with the number of dropped columns instead of the total number of columns. A place to start might be:

  1. create a special Drop relational operation
  2. implement the column drop in select merging by looping over the drop list and removing each element of the drop list from the Select fields. Select fields are dicts, so that should get us to O(droplist) instead of O(allcolumns).

We may be able to take a similar approach for relocate by using a data structure more optimized for the operation. I think something that has fast ordered inserts (perhaps sortedcontainers has a sorted map container?) might be a good place to start.

binste commented 2 weeks ago

Unfortunately, I can't as I'd need to mask column names and code logic for IP reasons. It's 2 input tables with each around 50 columns and 1 table with ~10 columns and then various operations on top of it. But happy to test out any PRs if there is a wheel file available!

gforsyth commented 2 weeks ago

Naive benchmark here, but for a quick test:

# drop_test.py
import ibis
import time
from contextlib import contextmanager

@contextmanager
def tictoc(num_cols):
    tic = time.time()
    yield
    toc = time.time()
    print(f"| {num_cols} | {toc - tic} seconds")

print(f"{ibis.__version__=}")

for num_cols in [10, 20, 50, 100, 200, 500, 1000]:
    t = ibis.table(name="t", schema=[(f"a{i}", "int") for i in range(num_cols)])

    with tictoc(num_cols):
        t.drop("a8")
🐚 python drop_test.py
ibis.__version__='8.0.0'
| 10 | 0.18016910552978516 seconds
| 20 | 0.0016529560089111328 seconds
| 50 | 0.0037081241607666016 seconds
| 100 | 0.007429361343383789 seconds
| 200 | 0.013902902603149414 seconds
| 500 | 0.03390693664550781 seconds
| 1000 | 0.06650948524475098 seconds
🐚 python drop_test.py
ibis.__version__='9.0.0'
| 10 | 0.002298593521118164 seconds
| 20 | 0.005956888198852539 seconds
| 50 | 0.027918338775634766 seconds
| 100 | 0.09690213203430176 seconds
| 200 | 0.36721324920654297 seconds
| 500 | 2.301510810852051 seconds
| 1000 | 9.317416191101074 seconds