pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.56k stars 1.88k forks source link

pl.col() mixes strangely with selectors #17352

Open douglas-raillard-arm opened 3 months ago

douglas-raillard-arm commented 3 months ago

Description

Not sure how this classifies but since it's likely to have been discussed elsewhere, let's take it as a doc improvement.

pl.col() can be mixed with the polars.selectors API and leads to unexpected-at-first results rather than e.g. a straight up exception.

import polars as pl
import polars.selectors as cs

df = pl.LazyFrame(dict(a=[1], b=[2]))

# Prints:
#  ┌─────┐
#  │ a   │
#  │ --- │
#  │ i64 │
#  ╞═════╡
#  │ 3   │
#  └─────┘
print(df.select(cs.col('a')     | cs.starts_with('b')).collect())

# Prints:
#  ┌─────┬─────┐
#  │ a   ┆ b   │
#  │ --- ┆ --- │
#  │ i64 ┆ i64 │
#  ╞═════╪═════╡
#  │ 1   ┆ 2   │
#  └─────┴─────┘
print(df.select(cs.by_name('a') | cs.starts_with('b')).collect())

Note that the behavior is consistent with pl.col('a') | pl.col('b').

When discovering the selectors API, I initially tried to combine pl.col() along with other selectors since I was used to using df.select(pl.col('foobar')) to select a column. This can lead to surprising behaviors when combining with selectors.

Link

No response

alexander-beedie commented 3 months ago

I believe this is covered by the following docs on selector set-ops: https://docs.pola.rs/api/python/stable/reference/selectors.html#set-operations

However, it might not be a bad idea simply to raise an error here; allowing only selectors to interop with other selectors via operators would prevent any such ambiguity 🤔 @stinodego, how do you feel about making this a touch stricter?

douglas-raillard-arm commented 3 months ago

@alexander-beedie maybe I'm blind, can you quote the specific part ? This doc explains what happens with cs.by_name(), but the main issue here is the different behavior of pl.col() and cs.by_name() wrt to | and what happens when mixed together.

I can understand how they are both the correct behavior in their category how both behaviors are desirable, but the end result feel like an unfortunate API conflict:

Taking inspiration elsewhere, if it was in Haskell they would simply be 2 different operators, since they have 2 different meaning fundamentally and there would be no problem. Cases where multiple implementations are reasonable based on use case are dealt with by not implementing it on the base types and coming up with zero-cost wrappers that "decide" which way to go (e.g. Sum and Product wrappers for Monoid). That's not really possible here since it would make Expr cumbersome to use.

Alternatively, not allowing selectors and expr to mix would fix that (and still allow some explicit mixing with selectors .to_expr()). Then everything is commutative again with no surprise, everything is still possible, and forbidden combos just raise rather than do something unexpected.

From a doc point of view, it might be interesting to stress the fact that operator overload is different than for expressions and show an example to point it out.

stinodego commented 3 months ago

This was already discussed in https://github.com/pola-rs/polars/issues/13757

This is a core issue with how selectors are currently set up as an Expr subclass.

We have to revisit this, but doing so would be a breaking change. I have to admit that I am a bit exhausted with API design now after the release of 1.0, so I'll come back to this one a bit later.