Closed zundertj closed 1 year ago
I don't think we can. with
is a python keyword.
Yep, you are right, that wont fly. Updated the post to reflect that.
Dropping with_column
in favour of with_columns
is definitely a good move; in the same vein as the other abbreviations how about shortening to with_cols
? Everyone's muscle memory (and autocomplete) will be halfway there already... ;)
Note: passing more than one column to with_column
would look really odd, whereas passing one or more to with_columns
would look pretty normal, hence favouring deprecation of the singular form.
Also, let's make the currently-experimental calling pattern allowed by default? It has proven really popular over here, and I shudder to think what would happen if it went away...😅
Note that with_columns
already supports the singular form, so it is not just normal, it is the current behaviour.
So I think we have:
with_column
with_columns
=> #6122My hesitation with with_cols
or with_col
is that it does not seem to be enough of an improvement, first to break with for example Snowflake, and second to do the deprecation. I guess the latter point, given we are 0.x and can easily support this change, is not too big of a problem, but I guess I'm hoping for a better name to come up over time, which would put users through another cycle of updating code.
My hesitation with
with_cols
orwith_col
is that it does not seem to be enough of an improvement ... would put users through another cycle of updating code.
True; it's likely to be one of the most widespread method calls too; that's a lot of search/replace without an obvious gain; probably not what you want right as polars feels like it's getting some real critical mass.
If one would ever rename with_columns
, then perhaps a reasonable renaming would be mutate
, which would bring it in line with tidyverse in R.
If one would ever rename
with_columns
, then perhaps a reasonable renaming would bemutate
, which would bring it in line with tidyverse in R.
It's a very vague/ambiguous name though... like: what does it mutate and how? The only people who might stand a chance at guessing will be the R illuminati ;)
mutate
would have the same issue as assign
, that it suggests the dataframe is modified rather than that a new dataframe is being returned. In that case, I would prefer assign
over mutate
.
True; it's likely to be one of the most widespread method calls too; that's a lot of search/replace without an obvious gain
The most used functions should get the shortest names because you type it so often imo.
I am in favor of keeping one with_columns
. I don't mind a few keystrokes to keep readable english names.
I don't think it's worth a refactor. Once you are at with_c
a tab does the rest.
And in truth it does actually do a good job of telling you what's going to happen, in a way that assign
or mutate
just don't. You're going to get the DataFrame... with columns
, and those columns are going to be the ones you give it; it's very clear.
Glad to see some backings today. Tough day😅
I agree that in absence of an equally clear, but shorter name, we should not change. I thought with
would be that one, but that won't work in combination with Python. To be sure, for me shorter names are not about keystrokes (with_c<tab>
is easy enough), but about readability. Longer names, if not more informative over their shorter counterpart, make imo code harder to follow, as more reading is required, more line breaks are required, etc.
I have raised a draft PR for deprecating with_column
, see https://github.com/pola-rs/polars/pull/6128.
Why not:
I agree with both keystrokes and readability arguments.
Or another route is to give the option to extend the DataFrame methods.
(btw, my first comment on that project => huge thanks, polars is amazing)
Why not/ create an alias .wc -> .with_columns (or .wcols or whatever short)
@cbzittoun: Aside from "wc" meaning toilet (seriously ;) it doesn't really offer a meaningful advantage, and it adds clutter to the API surface.
Or another route is to give the option to extend the DataFrame methods.
Now that is actually supported - as long as you do it in a custom namespace: \ https://pola-rs.github.io/polars/py-polars/html/reference/api.html
ahah yes but there is no reason the bathroom namespace should apply here
thanks for the the other solution, that's useful
I've a small suggestion (that could break things, though) about with_columns
. Currently, this method accepts a single parameter that is either an expression or a list of expressions (i.e., df.with_columns([..., ...])
to create two new columns). I propose to use *args
instead so that df.with_columns([..., ...])
can be written df.with_columns(..., ...)
. This has several advantages:
(1) It does not require to explicitly create a list of expressions (and prevents writing extra [
and ]
);
(2) It supports both the creation of a single new column or many at once (so the API for a single column won't change, and adding more columns just requires providing more parameters instead of converting the parameter to a list and extend it);
(3) It's probably more Pythonic ;-)
(4) Support for named expressions is really interesting (and I hope it will be enabled by default soon) since it makes the names of the new columns visible before their definition, as in pandas' .assign
(using .alias
means that the name can only be found at the end of the expression). By using *args
instead of a list in with_columns
, we increase the consistency since one can easily go from a **kwargs
-based call to an *args
-based call and vice-versa:
df.with_columns(
expr1,
expr2,
expr3,
)
can be easily converted to/from:
df.with_columns(
a=expr1,
b=expr2,
c=expr3,
)
I also like the idea of pl.Config.with_columns_kwargs = True
being enabled by default. When using black code format the code "expands". I'm not familiar with the reasoning for using lists or if changing it to **kwargs-based
would be worth it, though.
I'm not suggesting to go for a full (and only) kwargs-based api but for a combination of *args and *kwargs. In particular I think it would make more sense to use args instead of a list. But for consistency, that means that select, filter, col, etc should probably be "converted" to receive *args instead of a list, which may lead to backward incompatibilities...
@AlexandreDecan: the ergonomic downside with *args
style is passing in an existing (or programmatically-generated) list of expressions into the method; you would be amazed at how easy it is for people to forget to splat them in (eg: with_columns(*expressions)
) and then wander all over the code wondering where the error is.
I've designed significant APIs both ways in the past, but that tends to take users a while to really ingrain (which isn't to say I don't like it, as I currently have an API using this style at work - however, it was designed to be that way from the start, and is consistent across all interfaces). Without a compelling advantage (vs some minor niceties), I doubt we'd want to make a breaking change like that across multiple commonly-used methods without something/someone really driving it...
Thanks for your answer! I understand the rationale. But in that case, I would be in favour, for consistency, not to allow the use of **kwargs
and rather to allow to pass a dictionary instead of a list to support named expressions, i.e., df.with_columns({'a': expr1, 'b': expr2, 'c': expr3})
instead of df.with_coumns(a=expr1, b=expr2, c=expr3)
.
That way, the number of parameters is always exactly 1, and there's a parallel between passing a list of unnamed expressions and passing a dict of named expressions (this parallel could have been achieved by using *args
and **kwargs
, but I understand you prefer to avoid them).
Thanks for your answer! I understand the rationale. But in that case, I would be in favour, for consistency, not to allow the use of
**kwargs
and rather to allow to pass a dictionary instead of a list to support named expressions, i.e.,
That would be very unergonomic - there is no reason for **kwargs
style calls (an extremely common calling convention in python) to squeeze into a single parameter as a dict; I'd argue that would lose the reason for supporting this mode in the first place, which is primarily about how simple/pythonic it is ;)
Perhaps less ergonomic, but consistent with your choice for not supporting *args
. Anyway, I prefer to have **kwargs
rather than nothing :-p
Apart from this, and going back to your previous message in which you said "the ergonomic downside with *args style is passing in an existing (or programmatically-generated) list of expressions into the method;", what about supporting all these uses cases? For example:
def with_columns(self, *args, **kwargs):
if len(args) == 1:
if isinstance(args[0], list):
args = args[0]
...
consistent with your choice for not supporting *args
@AlexandreDecan: I admit I still don't see the parallel, but it seems we may be going for it after all! Note that I was pointing out that it isn't a free lunch, not that it isn't worthwhile - see here for details of the proposal :)
Good news :-)
Coming back to the naming of with_columns
, on Discord there was the suggestion select_with
, which emphasizes that with_columns(new_expr)
is equivalent to .select([pl.all(), new_expr]))
. I also like that it no longer has the columns
in the name, we also do not have filter_rows
but filter
, select
rather than select_columns
, etc. Downside is that with_columns
is somewhat established, but I think overall this is an improvement. Any thoughts?
It doesn't really highlight that the data frame is returned as-is with new columns (ie there is no projection).
I really don't think we should remove with_columns
. The expressions are added to the DataFrame as new columns. So I think the name is perfect.
This consolidation was completed for the 0.16
release, with a deprecation period via transparent @redirect
. Prepare to autocomplete that extra "s" , there is now just one method ;) 👍
Thanks @zundertj for bringing up this topic here. I was the one who recommended select_with
on Discord, so I'm adding my comment here.
It's obvious to me that with_columns
adds new columns but I think its equivalence to select(pl.all())
doesn't contribute to a simple and consistent terminology.
The latest move from with_column
to with_columns
shows a desire to add new keywords sparingly rather than for convenience. I think this goal can be applied further to select
with an arg to select all (perhaps v messy,), or to use select_with
(which becomes a polars-specific keyword but which appears beside the select in the api).
Since there's the extensible api, perhaps another approach that doesn't remove the conveniences is to move with_columns
into a utility 'crate' to mimic the myriad functions that motivated the introduction of with_column
and with_columns
in the first place. It would only be used by those coming from pyspark-ish environments, and I imagine they'd want more than just with_columns
.
Seeing the pyspark api for select, selectExpr, withColumn, withColumns, withColumnRenamed, I'm thankful that polars with_column is being obviated by with_columns
Looking forward to any feedback
with_columns is not only used to add new columns, but also to replace existing ones, so with pl.all, you would get duplicated column name errors.
The same applies to with_columns and groupby, so to me this point doesn't dissuade from cleaning up the api. In fact, the SQL standard doesn't disallow using the same alias.
The following is valid SQL: SELECT designation, hired_on as designation FROM employees
This speaks to me as though polars could evolve to more closely match the robust and familiar SQL specification.
Although it should be noted/discussed separately from this thread, one could argue that polars query planning should accommodate aliases created within a select/with_columns/agg expression for reuse within another array member. The semantics of an array of expressions right now is focused on parallelism. This forces users to manually create the 'intermediate nodes' in the query plan by chaining select/with_columns expressions only after the alias is declared, making the expressions longer and thus less concise.
I like with_columns
. It's not a big deviation from what we've been using for the past year when editing one column. Also, with_column
(singular) secretly let you operate on multiple columns if you had multiple columns in your pl.col(pl.Float32)
(for example), which always seemed wrong/a mistake to me. I didn't create an issue because it seemed like a can of worms which has now thankfully been resolved by the current issue.
I do want to add a grievance I have currently about with_columns
that I swore I wrote about elsewhere but I can't find it anywhere: it's not obvious when a column will get replaced and when a new one will be created. If I do:
df.with_columns(col('a')*col('b'))
The default behavior is to modify column a
because it happens to come first, despite commutativity. If I call
when(col('a') < 3).then(col('a')).otherwise(col('b'))
I get a
modified again. But if I say:
when(col('b') < 9).then(col('a')).otherwise(col('b'))
...column a
still gets modified! Is it because it's the first then
?
This is super confusing and not obvious at all.
Problem description
Initial discussion was on Discord, moving here so everyone can join.
The DataFrame methods
with_column
andwith_columns
tend to be used quite often, as these are, together withselect
, the primary way to run expressions on dataframes. I have a couple of issues with these though:with_columns
allows everythingwith_column
does, except checking that only one expression is passed in. Thus it seems to me very marginal benefit to have two methods. Marginal because I dont think there is ever confusion on whether you pass in one or multiple expressions.with_column
also does not support the (experimental) assignment syntax currently(per Ritchie's comment on Discord that they return a new dataframe): the docstrings are not clear what is copied and what not:
with_column
: Return a new DataFrame with the column added or replaced.with_columns
: Add or overwrite multiple columns in a DataFrame. The latter suggests there is no new dataframe created, but there is.with_columns
quite verbose given how often it is used. We have also opted to usepl.col
and notpl.column
, andpl.lit
rather thanpl.literal
. My initial proposal wasassign
, but @ritchie46 's point is that this may give the impression no new dataframe is being returned, while there is. So my new proposal would be to shorten towith
, except thatwith
is a reserved keyword In Python. So maybewith_col
, but that seems quite a marginal gain?Putting it altogether:
with_column
with_columns
towith_col
?with_columns
/with_col
to reflect that a new dataframe is being created