pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.23k stars 1.67k forks source link

feat(python): Add `require_all` parameter to the `by_name` column selector #15028

Closed alexander-beedie closed 1 month ago

alexander-beedie commented 3 months ago

Closes #15023. Closes #16219.

Adds a new (opt-in) "match_any" param to the by_name selector. Default behaviour remains unchanged.

_Update: have changed the parameter name from "any" (and then "match_any"), to "requireall" and adjusted the default value accordingly :sunglasses:

Before

Default behaviour; expect all given columns to be present:

import polars as pl
import polars.selectors as cs

df = pl.DataFrame({
    "foo": ["x", "y"],
    "bar": [123, 456],
    "baz": [2.0, 5.5],
})
df.select( cs.by_name("stroopwafel","baz","!!") )
# ColumnNotFoundError: stroopwafel

After

Enable permissive matching on any of the given columns:

df.select( cs.by_name("stroopwafel","baz","!!", require_all=False) )
# shape: (2, 1)
# ┌─────┐
# │ baz │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 2.0 │
# │ 5.5 │
# └─────┘
gab23r commented 3 months ago

any being a python keyword isn't an issue ? I think the argument name has to be carefully chosen, because it will appear in other places like: https://github.com/pola-rs/polars/issues/11913#issuecomment-1773738939

I was thinking of using strict, it is general and clear to user that it will not raise. But in the meantime strict is today link (only ?) to casting operations, so I am not really sure 🤔

alexander-beedie commented 3 months ago

any being a python keyword isn't an issue ?

It actually isn't, as we don't overwrite the global "any", or use "any" in the function. However, I do usually use "any_" as per that comment, as it's better practice overall - should probably do so here too. I'll change that when I finish in the gym ;)

I was thinking of using strict

Hmm, I don't think it's as obvious in this context what "strict" would imply? Not as obvious (to me) as "any" (or "any_"), at least.

gab23r commented 3 months ago

IMHO, any is not really clear. When I read it the first time, I really thought that it will select only one column ( like randomly ) within the list of columns, which is a weird use case I admit 😅. ignore_missing would be cristal clear, i guess. Have a good gym time 😁

alexander-beedie commented 3 months ago

IMHO, any is not really clear. When I read it the first time, I really thought that it will select only one column ( like randomly )

Well now you're making me want to add a cs.random(<n>) selector... ;)

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 80.82%. Comparing base (98a2d9b) to head (8f40dd4). Report is 11 commits behind head on main.

:exclamation: Current head 8f40dd4 differs from pull request most recent head 1639789

Please upload reports for the commit 1639789 to get more accurate results.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #15028 +/- ## ======================================= Coverage 80.82% 80.82% ======================================= Files 1393 1393 Lines 179341 179345 +4 Branches 2918 2920 +2 ======================================= + Hits 144955 144964 +9 + Misses 33884 33879 -5 Partials 502 502 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mcrumiller commented 3 months ago

+1 for ignore_missing, I was going to suggest that anyway. Having any underscore after any_ is just...I don't like it.

stinodego commented 3 months ago

This is similar to the discussion in https://github.com/pola-rs/polars/issues/11913

I'm not sure there needs to be a parameter for this, but if there is one, I wouldn't want to call it any_ as it is not descriptive enough.

ritchie46 commented 2 months ago

Any updates on this one? @alexander-beedie

alexander-beedie commented 1 month ago

Any updates on this one? @alexander-beedie

@ritchie46: Yup, just updated it and rebased against the latest code; left this dangling a bit longer than planned! I believe we do want a parameter here as it enables a clean selection pattern - apparently it just took me two months to think of an improved parameter name :))

cs.by_name("baz","foo","zap", match_any=True)

This reads a lot better than before; you're matching by name, and it'll match any of the names. Final feels self-explanatory now? (any_ by itself was indeed too vague).

dangotbanned commented 1 month ago
cs.by_name("baz","foo","zap", match_any=True)

This reads a lot better than before; you're matching by name, and it'll match any of the names. Final feels self-explanatory now? (any_

Really happy to see something like this @alexander-beedie

Seeing the new name, could this be generalised to the below? I'd been using this under the name matches_or, but changed after this PR.


def matches_any(*names: str) -> SelectorType:
    """
    Select any columns with an exact match in `names`.

    Where no columns match, no operation is performed.
    """
    return cs.matches(f"^{"|".join(names)}$")

Where names could be either a full column name (like by_name) or a regular expression (like matches).

alexander-beedie commented 1 month ago

With the new flag, we're saying "Select columns A, B, and C, and if any of those don't exist, you can just ignore them.".

No, we're saying "match any of A, B or C", which is even more straightforward.

With that in mind, wouldn't ignore_missing make much more sense?

Not to me; labelling columns that weren't matched as "missing" isn't correct, as "missing" implies that they were expected to be there. But if you have set "match_any" then you clearly don't expect some of them to be there, so they aren't missing - they simply aren't present, which isn't the same thing 😜

match_any could mean anything. I would read it as "Select any column that matches these names", which is in fact the default behavior.

That's not the default behaviour though; the default behaviour requires that you match all columns, not match any column. You only get to match any column if you set... match_any (which is why I think this parameter name is pretty clear ;)

alexander-beedie commented 1 month ago

Really happy to see something like this @alexander-beedie

😎👍

Seeing the new name, could this be generalised to the below? I'd been using this under the name matches_or, but changed after this PR.

Not really, as it would tread on the existing cs.matches selector, which does regex patterns rather than column names - two different behaviours/use-cases. I don't think we should look to blend them together.

(Your pattern match looks like it would fail on names with special chars in them? Could tweak like so: .join(re.escape(nm) for nm in names)[^1]).

[^1]: Or use:from polars._utils.various import re_escape so that the escapes better-match the underlying Rust regex crate. However, this is a private function so use judiciously... ;)

dangotbanned commented 1 month ago

Seeing the new name, could this be generalised to the below? I'd been using this under the name matches_or, but changed after this PR.

Not really, as it would tread on the existing cs.matches selector, which does regex patterns rather than column names - two different behaviours/use-cases. I don't think we should look to blend them together.

That makes sense. Looking forward to using the match_any parameter.

(Your pattern match looks like it would fail on names with special chars in them? Could tweak like so: .join(re.escape(nm) for nm in names).

Ooh good find! Thanks I hadn't tried this yet with any cases containing special chars. All the references I've found are using it exactly like by_name(..., match_any=True).

I'm only just seeing now that matches (quite rightly) doesn't already do this like contains, starts/ends_with.

alexander-beedie commented 1 month ago

Rebased / resolved conflicts... (Update: and again 😎)

stinodego commented 1 month ago

I really don't want to merge this with the current parameter name. It's extremely unclear to me. What are you matching? Any of what? What would it mean to set it to False, or True?

I went ahead and asked ChatGPT for help:

ME: I have a Python function that selects columns by name. It accepts a list of column names and returns the matching columns. If any of the column names do not have a matching column, it raises an error. I want to add a boolean parameter, which defaults to False, that will change the functionality to ignore column names without matches, instead of raising an error. What should such a parameter be called?

ChatGPT: A suitable name for the boolean parameter could be ignore_missing. This name clearly indicates its purpose: to control whether the function should ignore missing columns or raise an error.

Of course the prompt is leading it in that direction - but if you can show me a prompt that returns match_any I might be able to better understand where you're coming from.

I propose we change it to ignore_missing and this can be merged.

alexander-beedie commented 1 month ago

I propose we change it to ignore_missing and this can be merged.

I dislike the involvement of "missing" as it is incorrect. Unmatched columns are not missing, as that word implies an expectation that they should be there, which is wrong; if you expected them to be there then you wouldn't set a param that allows them not to be:

missing /ˈmɪsɪŋ/ (adjective)
(of a thing) not able to be found because it is not in its expected place.

In general l also prefer parameter names to indicate something active/explicit...

...rather than passive/implicit:

ChatGPT 4o result:

if you can show me a prompt that returns match_any I might be able to better understand where you're coming from.

chatgpt4o_prompt_param_name

What are you matching? Any of what?

It feels somewhat self-evident from the selector name; by_name is clearly matching by name, and settingmatch_any=True means I want to match any of those names 😅🤷

etiennebacher commented 1 month ago

In case it helps, I have two additional suggestions for this argument:

(One could also replace ignore_ by allow_)

mcrumiller commented 1 month ago

What about the somewhat vague but rather all-purpose strict, which almost universally means "don't coerce or ignore anything that isn't exactly specified"? We use it elsewhere and I think it's applicable here.

alexander-beedie commented 1 month ago

@stinodego: ok, here we go!

Thanks for the ideas all; we've discussed it thoroughly and are going with require_all; it's active/explicit like "match_any", avoids the implications of "ignore_missing", but is (we believe!) more self-explanatory than either of those - you either require all of the names to be present or you don't✌️