Open matterhorn103 opened 2 months ago
(Note that this is to request essentially the same thing as https://github.com/pola-rs/polars/issues/17573 but with a different suggested implementation. That issue was closed fairly immediately on the basis that further overloading of item()
is not desirable, which was a fair enough justification. I hoped/figured though that the concrete suggestion here (a new function) is different enough to warrant a new issue and may deserve to remain open rather than being closed.
If this is viewed as repetition of the old issue and essentially spam by me, I'm sorry.)
I'm sorry, I really don't see the added value of lookup
over the mentioned 'usual approaches'.
As mentioned, for newcomers to polars (and this is admittedly based primarily on my own experience) the "usual" way to do it is really not a conceptually obvious approach and is hard to find via a Google search. Indeed, googling leads a user to believe that using item()
with the row index and column name is the way to do such a cross-reference, which is what led to my original misguided request in the other issue to extend item()
to allow expressions as the row specifier.
All the user thinks they want to do is check the value for a given row and column without changing the dataframe, so the idea that they have to make the dataframe smaller first (even if it's a non-destructive operation due to the way polars works) is counterintuitive.
And the user might be tempted to do that using e.g. bob_age = df[1]["age"]
or bob_age = df["age"][1]
, but as discussed this is not considered the correct way to do it, requires the user to know the index of the row, and while the first expression returns a Series
the second expression returns an int
.
When narrowing down to a 1x1 df it is also generally easier to do the filter operation before the select operation, and that fact is also non-obvious to a beginner whose gut feeling is that the necessary operation is a simultaneous cross-reference rather than a step-wise one.
I'd also say it's not clear whether and when item()
is even needed, as it isn't obvious whether an expression that returns a 1x1 dataframe is always treated as a scalar whenever it is passed to something in polars that takes a scalar, or only sometimes and sometimes getting the actual scalar with item()
is necessary. For example, if I want to make the ages relative I can do df.with_columns(pl.col("age") / pl.col("age").min())
or df.with_columns(pl.col("age") / df.filter(name="Bob")["age"])
and both work without needing to use item()
, but is this general polars behaviour across all functions and expressions? If so then there is of course potentially further confusion when trying to use the value with anything outside of polars, and it's found that this has to be done with item()
, because e.g. int(df.filter(pl.col("name") == "Bob")["age"])
doesn't work.
Moreover, people coming from other frameworks and dataframe-handling libraries where direct analogues can be found might expect the availability of such a function.
Also, if lookup()
were flexible enough to take indices as row and column specifiers and expressions as column specifiers, this would also add the advantage of alleviating some of the confusion and complexity arises from the different approaches that are required depending on what combination of Expr/int and Expr/str/int are used – filter()
or row()
or df[i]
, or select()
or column()
or get_column()
or df[column]
, and whether a final item()
is necessary.
In that case, the function could be defined as something like:
def lookup(self: pl.DataFrame, row: pl.Expr | int, column: pl.Expr | str | int):
if isinstance(row, pl.Expr):
df_row = self.filter(row)
if len(df_row) > 1:
raise pl.exceptions.TooManyRowsReturnedError()
else:
df_row = self[row]
if isinstance(column, (pl.Expr, str)):
return df_row.select(column).item()
else:
return df_row.to_series(column).item()
or, as an alternative that I guess allows for better optimization of which methods are used under the hood in each case:
def lookup(self: pl.DataFrame, row: pl.Expr | int, column: pl.Expr | str | int):
match row, column:
case pl.Expr(), pl.Expr():
result = self.filter(row).select(column).item()
case pl.Expr(), str():
result = self.row(by_predicate=row, named=True)[column]
case pl.Expr(), int():
result = self.row(by_predicate=row)[column]
case int(), pl.Expr():
result = self[row].select(column).item()
case int(), str():
result = self.row(row, named=True)[column]
case int(), int():
result = self.row(row)[column]
return result
In both cases:
pl.DataFrame.lookup = lookup
# All of the following would normally require different chains of methods
bob_age = df.lookup(pl.col("name") == "Bob", "age") # Instead of e.g. df.filter(pl.col("name") == "Bob")["age"].item()
bob_age = df.lookup(pl.col("name") == "Bob", 2) # Instead of e.g. df.filter(pl.col("name") == "Bob").to_series(2).item()
bob_age = df.lookup(1, "age") # Instead of e.g. df.row(1, named=True)["age"] or df[1]["age"].item()
bob_age = df.lookup(1, 2) # Instead of e.g. df.row(1)[2]
# While queries that aren't specific enough neatly return an error
bob_age = df.lookup(pl.col("name").str.len_chars() < 6, "age") # Returns polars.exceptions.TooManyRowsReturnedError
bob_age = df.lookup(pl.col("name") == "Bob", pl.col(pl.Int64)) # Returns a ValueError
For me, this is less about easing transition from other libraries, but rather as syntactic sugar for a very common operation in day-to-day use. Specifically, it is about being able to look up a specific item based on a column acting as an index known to be unique. Having a distinct method would express the intent of the code a lot better.
On the basis of this usecase, I might suggest that the column name and condition be inverted, so that the condition can use keyword arguments. This would shorten df.lookup(pl.col("myindex") == myvalue, "mycol")
to df.lookup("mycol", myindex=myvalue)
, as with .select()
. A counter-argument to this would be that it would preclude the possibility of additional keyword arguments, and that it would be inconsistent with __getitem__
's API.
An alternative would be to add an index
keyword to .item()
, which might be a more palatable alternative to #17573 since it doesn't introduce another overload. This wouldn't allow arbitrary selection expressions, but that might be an advantage, forcing more complex expressions to use a more explicit form.
Description
Essentially the suggestion is just to provide a function:
df.lookup(row_condition, column)
as syntactic sugar for the more verbose:
df.filter(row_condition)[column].item()
that is currently required to extract a single scalar for a given row and column.
I know the polars philosophy is that method chaining is better than niche methods that do many things at once, but
item()
step);df.at()
and Excel'sXLOOKUP
.As a better example:
An error would be returned if the filter condition didn't reduce the DataFrame to a single row, which would be the case automatically if
item()
was used in the background.