Open MarcoGorelli opened 3 weeks ago
I'll implement Q9 and Q10
@MarcoGorelli I noticed that Narwhals doesn't yet have 'contains' in the ExprStringNamespace, which is used in query 9.
ah, thanks @ugohuche ! looks like we've found the next issue :) would you like to work on implementing str.contains
? If you look at str.ends_with
, I think it might be kind of similar, feel free to ask if you get stuck
once that's done, we can get back to this
Yeah, I'd love to give it a try
Just for context: the idea of implementing these queries is both for performance benchmarking and features availability. Meaning that if we are able to run all (most) of them, then that is a proxy for a large enough feature set we support. Is this the case @MarcoGorelli ?
yes, exactly!
I'll take Q11
edit: Narwhals doesn't have:
how="cross"
for DataFrame.join()
round
There is a user warning :
UserWarning: Found complex group-by expression, which can't be expressed efficiently with the pandas API. If you can, please rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...).
return agg_pandas(
I haven't tried to rewrite the query. I'm just reporting my findings up to now. π
@MarcoGorelli @FBruzzesi
thanks for reporting these! I think we can live with the userwarning for now and not worry about it, as long as it executes
fancy opening a separate issue about how='cross'
and round? they definitely seem generically useful enough that we should implement them
Findings from Q21:
DataFrame.join
doesn't have on
keyword (I believe by design), I replaced it with left_on=key, right_on=key
. Annoying but doable πExpr
does not support .len()
(opened a feature request already #322): while nw.len()
can work in some situations, there is a broad spectrum which does not cover.Findings from Q20
.agg((nw.col("l_quantity").sum() * 0.5).alias("sum_quantity"))
. Therefore the apply step is fired (which takes long long time - total query time 15+ mins). Breaking it down in two steps:
.agg((nw.col("l_quantity").sum()).alias("sum_quantity"))
.with_columns(sum_quantity = nw.col("sum_quantity") * 0.5)
brings the total time down to ~3-4.5 secs. Maybe we can add examples of how rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...).
can be done. Also, related to #299.
date
instead of datetime
object, pandas backend ends up with an error when filtering on a datetime column. Polars has no issuesMaybe we can add examples of how rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...). can be done
yeah, great idea!
If using date instead of datetime object, pandas backend ends up with an error when filtering on a datetime column. Polars has no issues
does this also happen for pyarrow-backed pandas?
does this also happen for pyarrow-backed pandas?
pyarrow backend has no issues!
Hi @MarcoGorelli, query 13 has a method "not_" which isn't implemented in Narwhals yet. Is it something we can add ?
I think we might as well just use ~
for that (e.g.. , instead of nw.col('a').not_()
, use (~nw.col('a'))
)
I think we might as well just use
~
for that (e.g.. , instead ofnw.col('a').not_()
, use(~nw.col('a'))
)
I'll use that instead.
Should I open a new issue for the when.then.otherwise()
methods ?
Thanks! I think there's already one open for that
Oh okay, I didn't see it
Hello guys, so i've been working on Q22, when i run this query:
from typing import Any
import narwhals as nw
def q22(
customer_ds_raw: Any,
orders_ds_raw: Any,
) -> Any:
customer_ds = nw.from_native(customer_ds_raw)
orders_ds = nw.from_native(orders_ds_raw)
q1 = (
customer_ds.with_columns(nw.col("c_phone").str.slice(0, 2).alias("cntrycode"))
.filter(nw.col("cntrycode").str.contains("13|31|23|29|30|18|17"))
.select("c_acctbal", "c_custkey", "cntrycode")
)
q2 = q1.filter(nw.col("c_acctbal") > 0.0).select(
nw.col("c_acctbal").mean().alias("avg_acctbal")
)
q3 = orders_ds.select(nw.col("o_custkey").unique()).with_columns(
nw.col("o_custkey").alias("c_custkey")
)
result = (
q1.join(q3, left_on="c_custkey", right_on="c_custkey")
.filter(nw.col("o_custkey").is_null())
.join(q2, how="cross")
.filter(nw.col("c_acctbal") > nw.col("avg_acctbal"))
.group_by("cntrycode")
.agg(
nw.col("c_acctbal").len().alias("numcust"),
nw.col("c_acctbal").sum().round(2).alias("totacctbal"),
)
.sort("cntrycode")
)
return nw.to_native(result)`
tool = 'pandas'
fn = IO_FUNCS[tool]
timings = %timeit -o q22(fn(customer), fn(orders))
results[tool] = timings.all_runs
i get an error: AttributeError: 'Expr' object has no attribute 'count'
But when i change .count() to .len(), i get another error.
C:\Users\Abano\anaconda3\Lib\site-packages\narwhals\_pandas_like\group_by.py:61: UserWarning: Found complex group-by expression, which can't be expressed efficiently with the pandas API. If you can, please rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...). return agg_pandas(
KeyError Traceback (most recent call last) Cell In[156], line 3 1 tool = 'pandas' 2 fn = IO_FUNCS[tool] ----> 3 timings = get_ipython().run_line_magic('timeit', '-o q22(fn(customer), fn(orders))') 4 results[tool] = timings.all_runs
File ~\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py:2456, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth) 2454 kwargs['local_ns'] = self.get_local_scope(stack_depth) 2455 with self.builtin_trap: -> 2456 result = fn(*args, **kwargs) 2458 # The code below prevents the output from being displayed 2459 # when using magics with decorator @output_can_be_silenced 2460 # when the last Python token in the expression is a ';'. 2461 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):
File ~\anaconda3\Lib\site-packages\IPython\core\magics\execution.py:1185, in ExecutionMagics.timeit(self, line, cell, local_ns) 1183 for index in range(0, 10): 1184 number = 10 ** index -> 1185 time_number = timer.timeit(number) 1186 if time_number >= 0.2: 1187 break
File ~\anaconda3\Lib\site-packages\IPython\core\magics\execution.py:173, in Timer.timeit(self, number) 171 gc.disable() 172 try: --> 173 timing = self.inner(it, self.timer) 174 finally: 175 if gcold:
File <magic-timeit>:1, in inner(_it, _timer)
Cell In[146], line 33, in q22(customer_ds_raw, orders_ds_raw) 19 q2 = q1.filter(nw.col("c_acctbal") > 0.0).select( 20 nw.col("c_acctbal").mean().alias("avg_acctbal") 21 ) 23 q3 = orders_ds.select(nw.col("o_custkey").unique()).with_columns( 24 nw.col("o_custkey").alias("c_custkey") 25 ) 27 result = ( 28 q1.join(q3, left_on="c_custkey", right_on="c_custkey") 29 .filter(nw.col("o_custkey").is_null()) 30 .join(q2, how="cross") 31 .filter(nw.col("c_acctbal") > nw.col("avg_acctbal")) 32 .group_by("cntrycode") ---> 33 .agg( 34 nw.col("c_acctbal").len().alias("numcust"), 35 nw.col("c_acctbal").sum().round(2).alias("totacctbal"), 36 ) 37 .sort("cntrycode") 38 ) 40 return nw.to_native(result)
File ~\anaconda3\Lib\site-packages\narwhals\group_by.py:108, in GroupBy.agg(self, *aggs, **named_aggs) 26 """ 27 Compute aggregations for each group of a group by operation. 28 (...) 104 βββββββ΄ββββββ΄ββββββ 105 """ 106 aggs, named_aggs = self._df._flatten_and_extract(*aggs, **named_aggs) 107 return self._df.__class__( --> 108 self._grouped.agg(*aggs, **named_aggs), 109 )
File ~\anaconda3\Lib\site-packages\narwhals\_pandas_like\group_by.py:61, in PandasGroupBy.agg(self, *aggs, **named_aggs) 58 raise ValueError(msg) 59 output_names.extend(expr._output_names) ---> 61 return agg_pandas( 62 self._grouped, 63 exprs, 64 self._keys, 65 output_names, 66 self._from_dataframe, 67 implementation, 68 )
File ~\anaconda3\Lib\site-packages\narwhals\_pandas_like\group_by.py:174, in agg_pandas(grouped, exprs, keys, output_names, from_dataframe, implementation) 171 result_complex = grouped.apply(func) 173 result = result_complex.reset_index() --> 174 return from_dataframe(result.loc[:, output_names])
File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1184, in _LocationIndexer.__getitem__(self, key) 1182 if self._is_scalar_access(key): 1183 return self.obj._get_value(*key, takeable=self._takeable) -> 1184 return self._getitem_tuple(key) 1185 else: 1186 # we by definition only have the 0th axis 1187 axis = self.axis or 0
File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup) 1374 if self._multi_take_opportunity(tup): 1375 return self._multi_take(tup) -> 1377 return self._getitem_tuple_same_dim(tup)
File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup) 1017 if com.is_null_slice(key): 1018 continue -> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i) 1021 # We should never have retval.ndim < self.ndim, as that should 1022 # be handled by the _getitem_lowerdim call above. 1023 assert retval.ndim == self.ndim
File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1420, in _LocIndexer._getitem_axis(self, key, axis) 1417 if hasattr(key, "ndim") and key.ndim > 1: 1418 raise ValueError("Cannot index with multidimensional key") -> 1420 return self._getitem_iterable(key, axis=axis) 1422 # nested tuple slicing 1423 if is_nested_tuple(key, labels):
File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1360, in _LocIndexer._getitem_iterable(self, key, axis) 1357 self._validate_key(key, axis) 1359 # A collection of keys -> 1360 keyarr, indexer = self._get_listlike_indexer(key, axis) 1361 return self.obj._reindex_with_indexers( 1362 {axis: [keyarr, indexer]}, copy=True, allow_dups=True 1363 )
File ~\anaconda3\Lib\site-packages\pandas\core\indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis) 1555 ax = self.obj._get_axis(axis) 1556 axis_name = self.obj._get_axis_name(axis) -> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name) 1560 return keyarr, indexer
File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:6200, in Index._get_indexer_strict(self, key, axis_name) 6197 else: 6198 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr) -> 6200 self._raise_if_missing(keyarr, indexer, axis_name) 6202 keyarr = self.take(indexer) 6203 if isinstance(key, Index): 6204 # GH 42790 - Preserve name from an Index
File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name) 6249 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) -> 6252 raise KeyError(f"{not_found} not in index")
KeyError: "['numcust', 'totacctbal'] not in index"
Whats the best way to resolve this?
I hope my explanation is clear enough?
In https://github.com/narwhals-dev/narwhals/tree/main/tpch/notebooks, we implement several TPC-H queries, and run them via Narwhals
We currently have the first 7 there, and should work on adding more
To track progress:
You can find reference implementations for these queries here: https://github.com/pola-rs/tpch/tree/main/queries/polars
The task here is:
q5_pandas_native
. We don't currently have reference implementations for them https://github.com/pola-rs/tpch/tree/main/queries/pandas, so we skip themdef q5
with something like the query you find in https://github.com/pola-rs/tpch/tree/main/queries/polars . You will need to modify it a bit, check the current notebooks and try to follow how they do it. You'll also need to get it to work for Narwhals, so that means:nw.col
instead ofpl.col
nw.from_native
at the start of the function for each input, andnw.to_native
at the endOnce you have a notebook which executes, please share it here!
Note: if you see
q5_pandas
orq5_ibis
, you can ignore them for the purpose of this issue, no need to copy them. Let's only focus on making sure that the Narwhals API is extensive enough for these queries