Closed alexander-beedie closed 1 month ago
any
being a python keyword isn't an issue ?
I think the argument name has to be carefully chosen, because it will appear in other places like: https://github.com/pola-rs/polars/issues/11913#issuecomment-1773738939
I was thinking of using strict
, it is general and clear to user that it will not raise. But in the meantime strict
is today link (only ?) to casting operations, so I am not really sure 🤔
any
being a python keyword isn't an issue ?
It actually isn't, as we don't overwrite the global "any", or use "any" in the function. However, I do usually use "any_" as per that comment, as it's better practice overall - should probably do so here too. I'll change that when I finish in the gym ;)
I was thinking of using
strict
Hmm, I don't think it's as obvious in this context what "strict" would imply? Not as obvious (to me) as "any" (or "any_"), at least.
IMHO, any
is not really clear. When I read it the first time, I really thought that it will select only one column ( like randomly ) within the list of columns, which is a weird use case I admit 😅. ignore_missing
would be cristal clear, i guess. Have a good gym time 😁
IMHO,
any
is not really clear. When I read it the first time, I really thought that it will select only one column ( like randomly )
Well now you're making me want to add a cs.random(<n>)
selector... ;)
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 80.82%. Comparing base (
98a2d9b
) to head (8f40dd4
). Report is 11 commits behind head on main.:exclamation: Current head 8f40dd4 differs from pull request most recent head 1639789
Please upload reports for the commit 1639789 to get more accurate results.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
+1 for ignore_missing
, I was going to suggest that anyway. Having any underscore after any_
is just...I don't like it.
This is similar to the discussion in https://github.com/pola-rs/polars/issues/11913
I'm not sure there needs to be a parameter for this, but if there is one, I wouldn't want to call it any_
as it is not descriptive enough.
Any updates on this one? @alexander-beedie
Any updates on this one? @alexander-beedie
@ritchie46: Yup, just updated it and rebased against the latest code; left this dangling a bit longer than planned! I believe we do want a parameter here as it enables a clean selection pattern - apparently it just took me two months to think of an improved parameter name :))
cs.by_name("baz","foo","zap", match_any=True)
This reads a lot better than before; you're matching by name, and it'll match any of the names. Final feels self-explanatory now? (any_
by itself was indeed too vague).
cs.by_name("baz","foo","zap", match_any=True)
This reads a lot better than before; you're matching by name, and it'll match any of the names. Final feels self-explanatory now? (
any_
Really happy to see something like this @alexander-beedie
Seeing the new name, could this be generalised to the below?
I'd been using this under the name matches_or
, but changed after this PR.
def matches_any(*names: str) -> SelectorType:
"""
Select any columns with an exact match in `names`.
Where no columns match, no operation is performed.
"""
return cs.matches(f"^{"|".join(names)}$")
Where names
could be either a full column name (like by_name
) or a regular expression (like matches
).
With the new flag, we're saying "Select columns A, B, and C, and if any of those don't exist, you can just ignore them.".
No, we're saying "match any of A, B or C", which is even more straightforward.
With that in mind, wouldn't
ignore_missing
make much more sense?
Not to me; labelling columns that weren't matched as "missing" isn't correct, as "missing" implies that they were expected to be there. But if you have set "match_any" then you clearly don't expect some of them to be there, so they aren't missing - they simply aren't present, which isn't the same thing 😜
match_any
could mean anything. I would read it as "Select any column that matches these names", which is in fact the default behavior.
That's not the default behaviour though; the default behaviour requires that you match all columns, not match any column. You only get to match any column if you set... match_any
(which is why I think this parameter name is pretty clear ;)
Really happy to see something like this @alexander-beedie
😎👍
Seeing the new name, could this be generalised to the below? I'd been using this under the name
matches_or
, but changed after this PR.
Not really, as it would tread on the existing cs.matches
selector, which does regex patterns rather than column names - two different behaviours/use-cases. I don't think we should look to blend them together.
(Your pattern match looks like it would fail on names with special chars in them? Could tweak like so: .join(re.escape(nm) for nm in names)
[^1]).
[^1]: Or use:from polars._utils.various import re_escape
so that the escapes better-match the underlying Rust regex crate. However, this is a private function so use judiciously... ;)
Seeing the new name, could this be generalised to the below? I'd been using this under the name
matches_or
, but changed after this PR.Not really, as it would tread on the existing
cs.matches
selector, which does regex patterns rather than column names - two different behaviours/use-cases. I don't think we should look to blend them together.
That makes sense. Looking forward to using the match_any
parameter.
(Your pattern match looks like it would fail on names with special chars in them? Could tweak like so:
.join(re.escape(nm) for nm in names)
.
Ooh good find! Thanks I hadn't tried this yet with any cases containing special chars.
All the references I've found are using it exactly like by_name(..., match_any=True)
.
I'm only just seeing now that matches
(quite rightly) doesn't already do this like contains
, starts/ends_with
.
Rebased / resolved conflicts... (Update: and again 😎)
I really don't want to merge this with the current parameter name. It's extremely unclear to me. What are you matching? Any of what? What would it mean to set it to False, or True?
I went ahead and asked ChatGPT for help:
ME: I have a Python function that selects columns by name. It accepts a list of column names and returns the matching columns. If any of the column names do not have a matching column, it raises an error. I want to add a boolean parameter, which defaults to False, that will change the functionality to ignore column names without matches, instead of raising an error. What should such a parameter be called?
ChatGPT: A suitable name for the boolean parameter could be
ignore_missing
. This name clearly indicates its purpose: to control whether the function should ignore missing columns or raise an error.
Of course the prompt is leading it in that direction - but if you can show me a prompt that returns match_any
I might be able to better understand where you're coming from.
I propose we change it to ignore_missing
and this can be merged.
I propose we change it to ignore_missing and this can be merged.
I dislike the involvement of "missing" as it is incorrect. Unmatched columns are not missing, as that word implies an expectation that they should be there, which is wrong; if you expected them to be there then you wouldn't set a param that allows them not to be:
missing /ˈmɪsɪŋ/ (adjective)
(of a thing) not able to be found because it is not in its expected place.
In general l also prefer parameter names to indicate something active/explicit...
match_any
"I want to match any of the names" ...rather than passive/implicit:
ignore_missing
"if some names aren't present then I don't want to raise an error"if you can show me a prompt that returns match_any I might be able to better understand where you're coming from.
What are you matching? Any of what?
It feels somewhat self-evident from the selector name; by_name
is clearly matching by name, and settingmatch_any=True
means I want to match any of those names 😅🤷
In case it helps, I have two additional suggestions for this argument:
ignore_absent
: IMO absent
doesn't give an "opinion" (I don't have a better word for this), it doesn't say that the column should be there (as missing
implies), simply that it isn't in the data.
ignore_nonexistent
: more or less the same explanation, "select those columns while ignoring those who don't exist"
(One could also replace ignore_
by allow_
)
What about the somewhat vague but rather all-purpose strict
, which almost universally means "don't coerce or ignore anything that isn't exactly specified"? We use it elsewhere and I think it's applicable here.
@stinodego: ok, here we go!
Thanks for the ideas all; we've discussed it thoroughly and are going with require_all
; it's active/explicit like "match_any", avoids the implications of "ignore_missing", but is (we believe!) more self-explanatory than either of those - you either require all of the names to be present or you don't✌️
Closes #15023. Closes #16219.
Adds a new (opt-in) "match_any" param to the
by_name
selector. Default behaviour remains unchanged._Update: have changed the parameter name from "any" (and then "match_any"), to "requireall" and adjusted the default value accordingly :sunglasses:
Before
Default behaviour; expect all given columns to be present:
After
Enable permissive matching on any of the given columns: