Open jitingxu1 opened 2 months ago
Can you separate out the time it takes to execute from the time it takes to construct the expression. Timing:
# how long the expression takes to construct
expr = t.describe()
# how long it takes to compile
ibis.to_sql(expr)
# and how long it takes to execute (time for below, minus time for compile above)
expr.execute()
import ibis
import pandas as pd
import numpy as np
num_rows = 1500000
num_cols = 450
data = np.random.rand(num_rows, num_cols)
columns = [f'col_{i}' for i in range(num_cols)]
df = pd.DataFrame(data, columns=columns)
memtable = ibis.memtable(df)
I tested table.info()
for this memtable
with 1.5m rows and 450 cols, here is the time for compilation and execution
Compile time: 5.6186909675598145 seconds
Compile + Running time: 109.90575790405273 seconds
The above running time is acceptable for this size.
The issue I met in the kaggle competition, I ran the df_train.info()
, but the df_train was generated by joining multiple tables, and each table underwent some operations, such as Cast
and agg
. With additional processing transformation, it make the final query in df_train.info()
super large. The execution failed with this error:
File /opt/conda/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py:1409, in Backend.execute(self, expr, params, limit, **_)
1406 import pandas as pd
1407 import pyarrow.types as pat
-> 1409 table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
1411 df = pd.DataFrame(
1412 {
1413 name: (
(...)
1425 }
1426 )
1427 df = DuckDBPandasData.convert_table(df, expr.as_table().schema())
OutOfMemoryException: Out of Memory Error: failed to offload data block of size 256.0 KiB (17.5 GiB/17.5 GiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'
I set the duckdb configuration
con.raw_sql("PRAGMA max_temp_directory_size='50GiB';")
con.raw_sql("PRAGMA max_memory='30GiB';")
but it did not help.
Looking into this, a large source of the Ibis overhead here is the 450 relations we are constructing, each of which must go through binding and dereferencing neither of which is particularly cheap.
I poked around DuckDB's SUMMARIZE
command, and looking at EXPLAIN SUMMARIZE t
I saw a curious-looking UNNEST
on an ungrouped aggregate, which made me remember that for pivot_longer
we collect (in the generic sense, unrelated to execution) the aggregates into a array expression and then do a joint unnest on all the arrays we've constructed.
I tried a similar approach here, and after adding some benchmarks I'm seeing about an 11x (!) speed up in expression construction:
----------------------------------------------------------------------------- benchmark 'test_summarize_construct[describe]': 2 tests ------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_summarize_construct[describe] (0001_fda484c) 12.6991 (11.46) 12.8223 (10.70) 12.7558 (11.14) 0.0453 (1.0) 12.7573 (11.46) 0.0546 (1.0) 2;0 0.0784 (0.09) 5 1
test_summarize_construct[describe] (NOW) 1.1084 (1.0) 1.1988 (1.0) 1.1450 (1.0) 0.0467 (1.03) 1.1130 (1.0) 0.0841 (1.54) 2;0 0.8734 (1.0) 5 1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The caveat here is that not all backends support arrays, so we won't be able to see this improvement in every backend.
For the curious, here's the DuckDB plan where you can see the UNNEST
:
D create table penguins as from read_csv('/data/penguins.csv');
D explain summarize (select year from penguins);
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ column_name │
│ column_type │
│ min │
│ max │
│ approx_unique │
│ avg │
│ std │
│ q25 │
│ q50 │
│ q75 │
│ count │
│ null_percentage │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNNEST │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNGROUPED_AGGREGATE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ min(#0) │
│ max(#1) │
│ approx_count_distinct(#2) │
│ avg(#3) │
│ stddev(#4) │
│ approx_quantile(#5) │
│ approx_quantile(#6) │
│ approx_quantile(#7) │
│ count_star() │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ year │
│ year │
│ year │
│ year │
│ CAST(year AS DOUBLE) │
│ year │
│ year │
│ year │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ penguins │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ year │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 344 │
└───────────────────────────┘
A follow up to #9684 might actually make a proper Summarize
operation in Ibis that at least in the DuckDB case would have an extremely small constant overhead to construct the expression.
Even before that, we could probably move this entire thing to compilation time which would make this much cheaper for both the union case and the array case ... I'll look into it.
Hi @cpcloud Thank you so much for your prompt check and fix.
Is your feature request related to a problem?
we have table.info() and
table.describe()
for Ibis table. The function loops over each column and performs multiple aggregations and form the output by unioning all the stats for each columnit is often used for univariate analysis by DS, When I tried these two functions on a bigger dataset (465 columns, 1.5m rows), I have two issues:
table.info()
,describe()
throws the Out of Memory exceptionThe
info()
majorly calculate the sum, mean of null rows, but it is very slow. Is it very expensive to calculate this on duckDB?The describe() throws Out of Memory exception, it will generate 465 small tables for the union, not sure why it takes so much memory.
What is the motivation behind your request?
These two functions are very useful for univariate analysis.
Describe the solution you'd like
Could we do some batch or parallel computing?
What version of ibis are you running?
9.1.0
What backend(s) are you using, if any?
DuckDB
Code of Conduct