bench: implement TPC-H queries using Narwhals

MarcoGorelli commented 3 weeks ago

In https://github.com/narwhals-dev/narwhals/tree/main/tpch/notebooks, we implement several TPC-H queries, and run them via Narwhals

We currently have the first 7 there, and should work on adding more

To track progress:

[ ] Q8: blocked by #47 . Skip this one for now
[ ] Q9
[ ] Q10
[ ] Q11
[ ] Q12
[ ] Q13
[ ] Q14
[ ] Q15
[ ] Q16
[ ] Q17
[ ] Q18
[ ] Q19
[ ] Q20
[ ] Q21
[ ] Q22

You can find reference implementations for these queries here: https://github.com/pola-rs/tpch/tree/main/queries/polars

The task here is:

Make an account on kaggle.com
choose one of the above queries (e.g. Q9 - don't
Make a fork of https://www.kaggle.com/code/marcogorelli/narwhals-tpch-q5-s2 (you should see a "copy and edit" button on Kaggle)
Remove q5_pandas_native. We don't currently have reference implementations for them https://github.com/pola-rs/tpch/tree/main/queries/pandas, so we skip them
replace def q5 with something like the query you find in https://github.com/pola-rs/tpch/tree/main/queries/polars . You will need to modify it a bit, check the current notebooks and try to follow how they do it. You'll also need to get it to work for Narwhals, so that means:
- using nw.col instead of pl.col
- using nw.from_native at the start of the function for each input, and nw.to_native at the end

Once you have a notebook which executes, please share it here!

Note: if you see q5_pandas or q5_ibis, you can ignore them for the purpose of this issue, no need to copy them. Let's only focus on making sure that the Narwhals API is extensive enough for these queries

ugohuche commented 3 weeks ago

I'll implement Q9 and Q10

ugohuche commented 3 weeks ago

@MarcoGorelli I noticed that Narwhals doesn't yet have 'contains' in the ExprStringNamespace, which is used in query 9.

MarcoGorelli commented 3 weeks ago

ah, thanks @ugohuche ! looks like we've found the next issue :) would you like to work on implementing str.contains? If you look at str.ends_with, I think it might be kind of similar, feel free to ask if you get stuck

once that's done, we can get back to this

ugohuche commented 3 weeks ago

Yeah, I'd love to give it a try

FBruzzesi commented 3 weeks ago

Just for context: the idea of implementing these queries is both for performance benchmarking and features availability. Meaning that if we are able to run all (most) of them, then that is a proxy for a large enough feature set we support. Is this the case @MarcoGorelli ?

MarcoGorelli commented 3 weeks ago

yes, exactly!

DeaMariaLeon commented 3 weeks ago

I'll take Q11

edit: Narwhals doesn't have:

how="cross" for DataFrame.join()
round

There is a user warning :

UserWarning: Found complex group-by expression, which can't be expressed efficiently with the pandas API. If you can, please rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...).
  return agg_pandas(

I haven't tried to rewrite the query. I'm just reporting my findings up to now. 😇

@MarcoGorelli @FBruzzesi

MarcoGorelli commented 2 weeks ago

thanks for reporting these! I think we can live with the userwarning for now and not worry about it, as long as it executes

fancy opening a separate issue about how='cross' and round? they definitely seem generically useful enough that we should implement them

FBruzzesi commented 2 weeks ago

Findings from Q21:

DataFrame.join doesn't have on keyword (I believe by design), I replaced it with left_on=key, right_on=key. Annoying but doable 😂
Expr does not support .len() (opened a feature request already #322): while nw.len() can work in some situations, there is a broad spectrum which does not cover.

FBruzzesi commented 2 weeks ago

Findings from Q20

There is a group by expression which is tagged as "complex", while in reality is a simple expression multiplied by a constant, namely .agg((nw.col("l_quantity").sum() * 0.5).alias("sum_quantity")). Therefore the apply step is fired (which takes long long time - total query time 15+ mins). Breaking it down in two steps:
```
.agg((nw.col("l_quantity").sum()).alias("sum_quantity"))
.with_columns(sum_quantity = nw.col("sum_quantity") * 0.5)
```
brings the total time down to ~3-4.5 secs. Maybe we can add examples of how rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...). can be done. Also, related to #299.
If using date instead of datetime object, pandas backend ends up with an error when filtering on a datetime column. Polars has no issues

MarcoGorelli commented 2 weeks ago

Maybe we can add examples of how rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...). can be done

yeah, great idea!

If using date instead of datetime object, pandas backend ends up with an error when filtering on a datetime column. Polars has no issues

does this also happen for pyarrow-backed pandas?

FBruzzesi commented 2 weeks ago

does this also happen for pyarrow-backed pandas?

pyarrow backend has no issues!

ugohuche commented 1 week ago

Hi @MarcoGorelli, query 13 has a method "not_" which isn't implemented in Narwhals yet. Is it something we can add ?

MarcoGorelli commented 1 week ago

I think we might as well just use ~ for that (e.g.. , instead of nw.col('a').not_(), use (~nw.col('a')))

ugohuche commented 1 week ago

I think we might as well just use ~ for that (e.g.. , instead of nw.col('a').not_(), use (~nw.col('a')))

I'll use that instead. Should I open a new issue for the when.then.otherwise() methods ?

MarcoGorelli commented 1 week ago

Thanks! I think there's already one open for that

ugohuche commented 1 week ago

Oh okay, I didn't see it

brentomagic commented 1 day ago

Hello guys, so i've been working on Q22, when i run this query:

from typing import Any import narwhals as nw

def q22( customer_ds_raw: Any, orders_ds_raw: Any, ) -> Any:

customer_ds = nw.from_native(customer_ds_raw)
orders_ds = nw.from_native(orders_ds_raw)

q1 = (
    customer_ds.with_columns(nw.col("c_phone").str.slice(0, 2).alias("cntrycode"))
    .filter(nw.col("cntrycode").str.contains("13|31|23|29|30|18|17"))
    .select("c_acctbal", "c_custkey", "cntrycode")
)

q2 = q1.filter(nw.col("c_acctbal") > 0.0).select(
    nw.col("c_acctbal").mean().alias("avg_acctbal")
)

q3 = orders_ds.select(nw.col("o_custkey").unique()).with_columns(
    nw.col("o_custkey").alias("c_custkey")
)

result = (
   q1.join(q3, left_on="c_custkey", right_on="c_custkey")
    .filter(nw.col("o_custkey").is_null())
    .join(q2, how="cross")
    .filter(nw.col("c_acctbal") > nw.col("avg_acctbal"))
    .group_by("cntrycode")
    .agg(
        nw.col("c_acctbal").len().alias("numcust"),
        nw.col("c_acctbal").sum().round(2).alias("totacctbal"),
    )
    .sort("cntrycode")
)

return nw.to_native(result)`

tool = 'pandas' fn = IO_FUNCS[tool] timings = %timeit -o q22(fn(customer), fn(orders)) results[tool] = timings.all_runs

i get an error: AttributeError: 'Expr' object has no attribute 'count'

But when i change .count() to .len(), i get another error.

C:\Users\Abano\anaconda3\Lib\site-packages\narwhals\_pandas_like\group_by.py:61: UserWarning: Found complex group-by expression, which can't be expressed efficiently with the pandas API. If you can, please rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...). return agg_pandas(

KeyError Traceback (most recent call last) Cell In[156], line 3 1 tool = 'pandas' 2 fn = IO_FUNCS[tool] ----> 3 timings = get_ipython().run_line_magic('timeit', '-o q22(fn(customer), fn(orders))') 4 results[tool] = timings.all_runs

File ~\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py:2456, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth) 2454 kwargs['local_ns'] = self.get_local_scope(stack_depth) 2455 with self.builtin_trap: -> 2456 result = fn(*args, **kwargs) 2458 # The code below prevents the output from being displayed 2459 # when using magics with decorator @output_can_be_silenced 2460 # when the last Python token in the expression is a ';'. 2461 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File ~\anaconda3\Lib\site-packages\IPython\core\magics\execution.py:1185, in ExecutionMagics.timeit(self, line, cell, local_ns) 1183 for index in range(0, 10): 1184 number = 10 ** index -> 1185 time_number = timer.timeit(number) 1186 if time_number >= 0.2: 1187 break

File ~\anaconda3\Lib\site-packages\IPython\core\magics\execution.py:173, in Timer.timeit(self, number) 171 gc.disable() 172 try: --> 173 timing = self.inner(it, self.timer) 174 finally: 175 if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

Cell In[146], line 33, in q22(customer_ds_raw, orders_ds_raw) 19 q2 = q1.filter(nw.col("c_acctbal") > 0.0).select( 20 nw.col("c_acctbal").mean().alias("avg_acctbal") 21 ) 23 q3 = orders_ds.select(nw.col("o_custkey").unique()).with_columns( 24 nw.col("o_custkey").alias("c_custkey") 25 ) 27 result = ( 28 q1.join(q3, left_on="c_custkey", right_on="c_custkey") 29 .filter(nw.col("o_custkey").is_null()) 30 .join(q2, how="cross") 31 .filter(nw.col("c_acctbal") > nw.col("avg_acctbal")) 32 .group_by("cntrycode") ---> 33 .agg( 34 nw.col("c_acctbal").len().alias("numcust"), 35 nw.col("c_acctbal").sum().round(2).alias("totacctbal"), 36 ) 37 .sort("cntrycode") 38 ) 40 return nw.to_native(result)

File ~\anaconda3\Lib\site-packages\narwhals\group_by.py:108, in GroupBy.agg(self, *aggs, **named_aggs) 26 """ 27 Compute aggregations for each group of a group by operation. 28 (...) 104 └─────┴─────┴─────┘ 105 """ 106 aggs, named_aggs = self._df._flatten_and_extract(*aggs, **named_aggs) 107 return self._df.__class__( --> 108 self._grouped.agg(*aggs, **named_aggs), 109 ) File ~\anaconda3\Lib\site-packages\narwhals\_pandas_like\group_by.py:61, in PandasGroupBy.agg(self, *aggs, **named_aggs) 58 raise ValueError(msg) 59 output_names.extend(expr._output_names) ---> 61 return agg_pandas( 62 self._grouped, 63 exprs, 64 self._keys, 65 output_names, 66 self._from_dataframe, 67 implementation, 68 )

File ~\anaconda3\Lib\site-packages\narwhals\_pandas_like\group_by.py:174, in agg_pandas(grouped, exprs, keys, output_names, from_dataframe, implementation) 171 result_complex = grouped.apply(func) 173 result = result_complex.reset_index() --> 174 return from_dataframe(result.loc[:, output_names])

File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1184, in _LocationIndexer.__getitem__(self, key) 1182 if self._is_scalar_access(key): 1183 return self.obj._get_value(*key, takeable=self._takeable) -> 1184 return self._getitem_tuple(key) 1185 else: 1186 # we by definition only have the 0th axis 1187 axis = self.axis or 0

File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup) 1374 if self._multi_take_opportunity(tup): 1375 return self._multi_take(tup) -> 1377 return self._getitem_tuple_same_dim(tup)

File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup) 1017 if com.is_null_slice(key): 1018 continue -> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i) 1021 # We should never have retval.ndim < self.ndim, as that should 1022 # be handled by the _getitem_lowerdim call above. 1023 assert retval.ndim == self.ndim

File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1420, in _LocIndexer._getitem_axis(self, key, axis) 1417 if hasattr(key, "ndim") and key.ndim > 1: 1418 raise ValueError("Cannot index with multidimensional key") -> 1420 return self._getitem_iterable(key, axis=axis) 1422 # nested tuple slicing 1423 if is_nested_tuple(key, labels):

File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1360, in _LocIndexer._getitem_iterable(self, key, axis) 1357 self._validate_key(key, axis) 1359 # A collection of keys -> 1360 keyarr, indexer = self._get_listlike_indexer(key, axis) 1361 return self.obj._reindex_with_indexers( 1362 {axis: [keyarr, indexer]}, copy=True, allow_dups=True 1363 )

File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis) 1555 ax = self.obj._get_axis(axis) 1556 axis_name = self.obj._get_axis_name(axis) -> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name) 1560 return keyarr, indexer

File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:6200, in Index._get_indexer_strict(self, key, axis_name) 6197 else: 6198 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr) -> 6200 self._raise_if_missing(keyarr, indexer, axis_name) 6202 keyarr = self.take(indexer) 6203 if isinstance(key, Index): 6204 # GH 42790 - Preserve name from an Index

File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name) 6249 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) -> 6252 raise KeyError(f"{not_found} not in index")

KeyError: "['numcust', 'totacctbal'] not in index"

Whats the best way to resolve this?

I hope my explanation is clear enough?

narwhals-dev / narwhals

bench: implement TPC-H queries using Narwhals #282