pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.49k stars 1.98k forks source link

Add a `DataFrame.lookup()` function for easy extraction of a single data point #18435

Open matterhorn103 opened 2 months ago

matterhorn103 commented 2 months ago

Description

Essentially the suggestion is just to provide a function:

df.lookup(row_condition, column)

as syntactic sugar for the more verbose:

df.filter(row_condition)[column].item()

that is currently required to extract a single scalar for a given row and column.

I know the polars philosophy is that method chaining is better than niche methods that do many things at once, but

  1. this is a common sort of query (I personally would use it very frequently);
  2. the current way to do this is really quite non-obvious to newcomers (particular the necessity of the item() step);
  3. this would provide a neat direct equivalent to e.g. pandas' df.at() and Excel's XLOOKUP.

As a better example:

import polars as pl

df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})

# Usual approaches
bob_age = df.filter(pl.col("name") == "Bob")["age"].item()
bob_age = df.select(pl.col("age").filter(name="Bob")).item()

# Proposed function
bob_age = df.lookup((pl.col("name") == "Bob"), "age")

An error would be returned if the filter condition didn't reduce the DataFrame to a single row, which would be the case automatically if item() was used in the background.

matterhorn103 commented 2 months ago

(Note that this is to request essentially the same thing as https://github.com/pola-rs/polars/issues/17573 but with a different suggested implementation. That issue was closed fairly immediately on the basis that further overloading of item() is not desirable, which was a fair enough justification. I hoped/figured though that the concrete suggestion here (a new function) is different enough to warrant a new issue and may deserve to remain open rather than being closed.

If this is viewed as repetition of the old issue and essentially spam by me, I'm sorry.)

orlp commented 2 months ago

I'm sorry, I really don't see the added value of lookup over the mentioned 'usual approaches'.

matterhorn103 commented 2 months ago

As mentioned, for newcomers to polars (and this is admittedly based primarily on my own experience) the "usual" way to do it is really not a conceptually obvious approach and is hard to find via a Google search. Indeed, googling leads a user to believe that using item() with the row index and column name is the way to do such a cross-reference, which is what led to my original misguided request in the other issue to extend item() to allow expressions as the row specifier.

All the user thinks they want to do is check the value for a given row and column without changing the dataframe, so the idea that they have to make the dataframe smaller first (even if it's a non-destructive operation due to the way polars works) is counterintuitive.

And the user might be tempted to do that using e.g. bob_age = df[1]["age"] or bob_age = df["age"][1], but as discussed this is not considered the correct way to do it, requires the user to know the index of the row, and while the first expression returns a Series the second expression returns an int.

When narrowing down to a 1x1 df it is also generally easier to do the filter operation before the select operation, and that fact is also non-obvious to a beginner whose gut feeling is that the necessary operation is a simultaneous cross-reference rather than a step-wise one.

I'd also say it's not clear whether and when item() is even needed, as it isn't obvious whether an expression that returns a 1x1 dataframe is always treated as a scalar whenever it is passed to something in polars that takes a scalar, or only sometimes and sometimes getting the actual scalar with item() is necessary. For example, if I want to make the ages relative I can do df.with_columns(pl.col("age") / pl.col("age").min()) or df.with_columns(pl.col("age") / df.filter(name="Bob")["age"]) and both work without needing to use item(), but is this general polars behaviour across all functions and expressions? If so then there is of course potentially further confusion when trying to use the value with anything outside of polars, and it's found that this has to be done with item(), because e.g. int(df.filter(pl.col("name") == "Bob")["age"]) doesn't work.

Moreover, people coming from other frameworks and dataframe-handling libraries where direct analogues can be found might expect the availability of such a function.

matterhorn103 commented 2 months ago

Also, if lookup() were flexible enough to take indices as row and column specifiers and expressions as column specifiers, this would also add the advantage of alleviating some of the confusion and complexity arises from the different approaches that are required depending on what combination of Expr/int and Expr/str/int are used – filter() or row() or df[i], or select() or column() or get_column() or df[column], and whether a final item() is necessary.

In that case, the function could be defined as something like:

def lookup(self: pl.DataFrame, row: pl.Expr | int, column: pl.Expr | str | int):
    if isinstance(row, pl.Expr):
        df_row = self.filter(row)
        if len(df_row) > 1:
            raise pl.exceptions.TooManyRowsReturnedError()
    else:
        df_row = self[row]
    if isinstance(column, (pl.Expr, str)):
        return df_row.select(column).item()
    else:
        return df_row.to_series(column).item()

or, as an alternative that I guess allows for better optimization of which methods are used under the hood in each case:

def lookup(self: pl.DataFrame, row: pl.Expr | int, column: pl.Expr | str | int):
    match row, column:
        case pl.Expr(), pl.Expr():
            result = self.filter(row).select(column).item()
        case pl.Expr(), str():
            result = self.row(by_predicate=row, named=True)[column]
        case pl.Expr(), int():
            result = self.row(by_predicate=row)[column]
        case int(), pl.Expr():
            result = self[row].select(column).item()
        case int(), str():
            result = self.row(row, named=True)[column]
        case int(), int():
            result = self.row(row)[column]
    return result

In both cases:

pl.DataFrame.lookup = lookup

# All of the following would normally require different chains of methods
bob_age = df.lookup(pl.col("name") == "Bob", "age") # Instead of e.g. df.filter(pl.col("name") == "Bob")["age"].item()
bob_age = df.lookup(pl.col("name") == "Bob", 2)     # Instead of e.g. df.filter(pl.col("name") == "Bob").to_series(2).item()
bob_age = df.lookup(1, "age")                       # Instead of e.g. df.row(1, named=True)["age"] or df[1]["age"].item()
bob_age = df.lookup(1, 2)                           # Instead of e.g. df.row(1)[2]

# While queries that aren't specific enough neatly return an error
bob_age = df.lookup(pl.col("name").str.len_chars() < 6, "age")  # Returns polars.exceptions.TooManyRowsReturnedError
bob_age = df.lookup(pl.col("name") == "Bob", pl.col(pl.Int64))  # Returns a ValueError
movpasd commented 1 week ago

For me, this is less about easing transition from other libraries, but rather as syntactic sugar for a very common operation in day-to-day use. Specifically, it is about being able to look up a specific item based on a column acting as an index known to be unique. Having a distinct method would express the intent of the code a lot better.

On the basis of this usecase, I might suggest that the column name and condition be inverted, so that the condition can use keyword arguments. This would shorten df.lookup(pl.col("myindex") == myvalue, "mycol") to df.lookup("mycol", myindex=myvalue), as with .select(). A counter-argument to this would be that it would preclude the possibility of additional keyword arguments, and that it would be inconsistent with __getitem__'s API.

An alternative would be to add an index keyword to .item(), which might be a more palatable alternative to #17573 since it doesn't introduce another overload. This wouldn't allow arbitrary selection expressions, but that might be an advantage, forcing more complex expressions to use a more explicit form.