feat: optimize table info() and describe() for large column tables efficiently

jitingxu1 commented 2 months ago

Is your feature request related to a problem?

we have table.info() andtable.describe() for Ibis table. The function loops over each column and performs multiple aggregations and form the output by unioning all the stats for each column

it is often used for univariate analysis by DS, When I tried these two functions on a bigger dataset (465 columns, 1.5m rows), I have two issues:

It took forever to get the results for the table.info(),
describe() throws the Out of Memory exception

The info() majorly calculate the sum, mean of null rows, but it is very slow. Is it very expensive to calculate this on duckDB?

for pos, colname in enumerate(self.columns):
            col = self[colname]
            typ = col.type()
            agg = self.select(
                isna=ibis.case().when(col.isnull(), 1).else_(0).end()
            ).agg(
                name=lit(colname),
                type=lit(str(typ)),
                nullable=lit(typ.nullable),
                nulls=lambda t: t.isna.sum(),
                non_nulls=lambda t: (1 - t.isna).sum(),
                null_frac=lambda t: t.isna.mean(),
                pos=lit(pos, type=dt.int16),
            )
            aggs.append(agg)
        return ibis.union(*aggs).order_by(ibis.asc("pos"))

The describe() throws Out of Memory exception, it will generate 465 small tables for the union, not sure why it takes so much memory.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/core.py:9 │
│ 9 in __rich_console__                                                                            │
│                                                                                                  │
│    96 │   │                                                                                      │
│    97 │   │   try:                                                                               │
│    98 │   │   │   if opts.interactive:                                                           │
│ ❱  99 │   │   │   │   rich_object = to_rich(self, console_width=console_width)                   │
│   100 │   │   │   else:                                                                          │
│   101 │   │   │   │   rich_object = Text(self._noninteractive_repr())                            │
│   102 │   │   except Exception as e:                                                             │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/pretty.py │
│ :273 in to_rich                                                                                  │
│                                                                                                  │
│   270 │   │   │   expr, max_length=max_length, max_string=max_string, max_depth=max_depth        │
│   271 │   │   )                                                                                  │
│   272 │   else:                                                                                  │
│ ❱ 273 │   │   return _to_rich_table(                                                             │
│   274 │   │   │   expr,                                                                          │
│   275 │   │   │   max_rows=max_rows,                                                             │
│   276 │   │   │   max_columns=max_columns,                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/pretty.py │
│ :345 in _to_rich_table                                                                           │
│                                                                                                  │
│   342 │   │   if orig_ncols > len(computed_cols):                                                │
│   343 │   │   │   table = table.select(*computed_cols)                                           │
│   344 │                                                                                          │
│ ❱ 345 │   result = table.limit(max_rows + 1).to_pyarrow()                                        │
│   346 │   # Now format the columns in order, stopping if the console width would                 │
│   347 │   # be exceeded.                                                                         │
│   348 │   col_info = []                                                                          │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/core.py:4 │
│ 86 in to_pyarrow                                                                                 │
│                                                                                                  │
│   483 │   │   Table                                                                              │
│   484 │   │   │   A pyarrow table holding the results of the executed expression.                │
│   485 │   │   """                                                                                │
│ ❱ 486 │   │   return self._find_backend(use_default=True).to_pyarrow(                            │
│   487 │   │   │   self, params=params, limit=limit, **kwargs                                     │
│   488 │   │   )                                                                                  │
│   489                                                                                            │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/backends/duckdb/__in │
│ it__.py:1375 in to_pyarrow                                                                       │
│                                                                                                  │
│   1372 │   │   limit: int | str | None = None,                                                   │
│   1373 │   │   **_: Any,                                                                         │
│   1374 │   ) -> pa.Table:                                                                        │
│ ❱ 1375 │   │   table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()        │
│   1376 │   │   return expr.__pyarrow_result__(table)                                             │
│   1377 │                                                                                         │
│   1378 │   def execute(                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryException: Out of Memory Error: failed to offload data block of size 256.0 KiB (19.6 GiB/19.6 GiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'

What is the motivation behind your request?

These two functions are very useful for univariate analysis.

Describe the solution you'd like

Could we do some batch or parallel computing?

What version of ibis are you running?

9.1.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

[X] I agree to follow this project's Code of Conduct

jcrist commented 2 months ago

Can you separate out the time it takes to execute from the time it takes to construct the expression. Timing:

# how long the expression takes to construct
expr = t.describe()

# how long it takes to compile
ibis.to_sql(expr)

# and how long it takes to execute (time for below, minus time for compile above)
expr.execute()

jitingxu1 commented 2 months ago

import ibis
import pandas as pd
import numpy as np

num_rows = 1500000
num_cols = 450
data = np.random.rand(num_rows, num_cols)
columns = [f'col_{i}' for i in range(num_cols)]
df = pd.DataFrame(data, columns=columns)
memtable = ibis.memtable(df)

I tested table.info() for this memtable with 1.5m rows and 450 cols, here is the time for compilation and execution

Compile time: 5.6186909675598145 seconds
Compile + Running time: 109.90575790405273 seconds

The above running time is acceptable for this size.

The issue I met in the kaggle competition, I ran the df_train.info(), but the df_train was generated by joining multiple tables, and each table underwent some operations, such as Cast and agg. With additional processing transformation, it make the final query in df_train.info() super large. The execution failed with this error:

File /opt/conda/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py:1409, in Backend.execute(self, expr, params, limit, **_)
   1406 import pandas as pd
   1407 import pyarrow.types as pat
-> 1409 table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
   1411 df = pd.DataFrame(
   1412     {
   1413         name: (
   (...)
   1425     }
   1426 )
   1427 df = DuckDBPandasData.convert_table(df, expr.as_table().schema())

OutOfMemoryException: Out of Memory Error: failed to offload data block of size 256.0 KiB (17.5 GiB/17.5 GiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'

I set the duckdb configuration

con.raw_sql("PRAGMA max_temp_directory_size='50GiB';")
con.raw_sql("PRAGMA max_memory='30GiB';")

but it did not help.

cpcloud commented 2 months ago

Looking into this, a large source of the Ibis overhead here is the 450 relations we are constructing, each of which must go through binding and dereferencing neither of which is particularly cheap.

I poked around DuckDB's SUMMARIZE command, and looking at EXPLAIN SUMMARIZE t I saw a curious-looking UNNEST on an ungrouped aggregate, which made me remember that for pivot_longer we collect (in the generic sense, unrelated to execution) the aggregates into a array expression and then do a joint unnest on all the arrays we've constructed.

I tried a similar approach here, and after adding some benchmarks I'm seeing about an 11x (!) speed up in expression construction:

----------------------------------------------------------------------------- benchmark 'test_summarize_construct[describe]': 2 tests ------------------------------------------------------------------------------
Name (time in s)                                          Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_summarize_construct[describe] (0001_fda484c)     12.6991 (11.46)    12.8223 (10.70)    12.7558 (11.14)    0.0453 (1.0)      12.7573 (11.46)    0.0546 (1.0)           2;0  0.0784 (0.09)          5           1
test_summarize_construct[describe] (NOW)               1.1084 (1.0)       1.1988 (1.0)       1.1450 (1.0)      0.0467 (1.03)      1.1130 (1.0)      0.0841 (1.54)          2;0  0.8734 (1.0)           5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The caveat here is that not all backends support arrays, so we won't be able to see this improvement in every backend.

cpcloud commented 2 months ago

For the curious, here's the DuckDB plan where you can see the UNNEST:

D create table penguins as from read_csv('/data/penguins.csv');
D explain summarize (select year from penguins);

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        column_name        │
│        column_type        │
│            min            │
│            max            │
│       approx_unique       │
│            avg            │
│            std            │
│            q25            │
│            q50            │
│            q75            │
│           count           │
│      null_percentage      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           UNNEST          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          min(#0)          │
│          max(#1)          │
│ approx_count_distinct(#2) │
│          avg(#3)          │
│         stddev(#4)        │
│    approx_quantile(#5)    │
│    approx_quantile(#6)    │
│    approx_quantile(#7)    │
│        count_star()       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            year           │
│            year           │
│            year           │
│            year           │
│    CAST(year AS DOUBLE)   │
│            year           │
│            year           │
│            year           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          penguins         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            year           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          EC: 344          │
└───────────────────────────┘

cpcloud commented 2 months ago

A follow up to #9684 might actually make a proper Summarize operation in Ibis that at least in the DuckDB case would have an extremely small constant overhead to construct the expression.

cpcloud commented 2 months ago

Even before that, we could probably move this entire thing to compilation time which would make this much cheaper for both the union case and the array case ... I'll look into it.

jitingxu1 commented 2 months ago

Hi @cpcloud Thank you so much for your prompt check and fix.

ibis-project / ibis