pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.26k stars 1.96k forks source link

Rationalize with_column & with_columns #6117

Closed zundertj closed 1 year ago

zundertj commented 1 year ago

Problem description

Initial discussion was on Discord, moving here so everyone can join.

The DataFrame methods with_column and with_columns tend to be used quite often, as these are, together with select, the primary way to run expressions on dataframes. I have a couple of issues with these though:

  1. with_columns allows everything with_column does, except checking that only one expression is passed in. Thus it seems to me very marginal benefit to have two methods. Marginal because I dont think there is ever confusion on whether you pass in one or multiple expressions.

  2. with_column also does not support the (experimental) assignment syntax currently

    pl.Config.with_columns_kwargs = True
    df.with_columns(b=pl.col("a") * 2)  # works
    df.with_column(b=pl.col("a") * 2)  # fails
  3. (per Ritchie's comment on Discord that they return a new dataframe): the docstrings are not clear what is copied and what not:

  1. I find with_columns quite verbose given how often it is used. We have also opted to use pl.col and not pl.column, and pl.lit rather than pl.literal. My initial proposal was assign, but @ritchie46 's point is that this may give the impression no new dataframe is being returned, while there is. So my new proposal would be to shorten to with, except that with is a reserved keyword In Python. So maybe with_col, but that seems quite a marginal gain?

Putting it altogether:

  1. deprecate with_column
  2. rename with_columns to with_col?
  3. fix the docstring of with_columns/with_col to reflect that a new dataframe is being created
ritchie46 commented 1 year ago

I don't think we can. with is a python keyword.

zundertj commented 1 year ago

Yep, you are right, that wont fly. Updated the post to reflect that.

alexander-beedie commented 1 year ago

Dropping with_column in favour of with_columns is definitely a good move; in the same vein as the other abbreviations how about shortening to with_cols? Everyone's muscle memory (and autocomplete) will be halfway there already... ;)

Note: passing more than one column to with_column would look really odd, whereas passing one or more to with_columns would look pretty normal, hence favouring deprecation of the singular form.

Also, let's make the currently-experimental calling pattern allowed by default? It has proven really popular over here, and I shudder to think what would happen if it went away...😅

zundertj commented 1 year ago

Note that with_columns already supports the singular form, so it is not just normal, it is the current behaviour.

So I think we have:

  1. deprecate with_column
  2. fix docstring with_columns => #6122
  3. make experimental calling pattern the default; remove flag.

My hesitation with with_cols or with_col is that it does not seem to be enough of an improvement, first to break with for example Snowflake, and second to do the deprecation. I guess the latter point, given we are 0.x and can easily support this change, is not too big of a problem, but I guess I'm hoping for a better name to come up over time, which would put users through another cycle of updating code.

alexander-beedie commented 1 year ago

My hesitation with with_cols or with_col is that it does not seem to be enough of an improvement ... would put users through another cycle of updating code.

True; it's likely to be one of the most widespread method calls too; that's a lot of search/replace without an obvious gain; probably not what you want right as polars feels like it's getting some real critical mass.

pjj commented 1 year ago

If one would ever rename with_columns, then perhaps a reasonable renaming would be mutate, which would bring it in line with tidyverse in R.

alexander-beedie commented 1 year ago

If one would ever rename with_columns, then perhaps a reasonable renaming would be mutate, which would bring it in line with tidyverse in R.

It's a very vague/ambiguous name though... like: what does it mutate and how? The only people who might stand a chance at guessing will be the R illuminati ;)

zundertj commented 1 year ago

mutate would have the same issue as assign, that it suggests the dataframe is modified rather than that a new dataframe is being returned. In that case, I would prefer assign over mutate.

Marmeladenbrot commented 1 year ago

True; it's likely to be one of the most widespread method calls too; that's a lot of search/replace without an obvious gain

The most used functions should get the shortest names because you type it so often imo.

ritchie46 commented 1 year ago

I am in favor of keeping one with_columns. I don't mind a few keystrokes to keep readable english names.

I don't think it's worth a refactor. Once you are at with_c a tab does the rest.

alexander-beedie commented 1 year ago

And in truth it does actually do a good job of telling you what's going to happen, in a way that assign or mutate just don't. You're going to get the DataFrame... with columns, and those columns are going to be the ones you give it; it's very clear.

ritchie46 commented 1 year ago

Glad to see some backings today. Tough day😅

zundertj commented 1 year ago

I agree that in absence of an equally clear, but shorter name, we should not change. I thought with would be that one, but that won't work in combination with Python. To be sure, for me shorter names are not about keystrokes (with_c<tab> is easy enough), but about readability. Longer names, if not more informative over their shorter counterpart, make imo code harder to follow, as more reading is required, more line breaks are required, etc.

I have raised a draft PR for deprecating with_column, see https://github.com/pola-rs/polars/pull/6128.

cbzittoun commented 1 year ago

Why not:

I agree with both keystrokes and readability arguments.

Or another route is to give the option to extend the DataFrame methods.

(btw, my first comment on that project => huge thanks, polars is amazing)

alexander-beedie commented 1 year ago

Why not/ create an alias .wc -> .with_columns (or .wcols or whatever short)

@cbzittoun: Aside from "wc" meaning toilet (seriously ;) it doesn't really offer a meaningful advantage, and it adds clutter to the API surface.

Or another route is to give the option to extend the DataFrame methods.

Now that is actually supported - as long as you do it in a custom namespace: \ https://pola-rs.github.io/polars/py-polars/html/reference/api.html

cbzittoun commented 1 year ago

ahah yes but there is no reason the bathroom namespace should apply here

thanks for the the other solution, that's useful

AlexandreDecan commented 1 year ago

I've a small suggestion (that could break things, though) about with_columns. Currently, this method accepts a single parameter that is either an expression or a list of expressions (i.e., df.with_columns([..., ...]) to create two new columns). I propose to use *args instead so that df.with_columns([..., ...]) can be written df.with_columns(..., ...). This has several advantages:

(1) It does not require to explicitly create a list of expressions (and prevents writing extra [ and ]); (2) It supports both the creation of a single new column or many at once (so the API for a single column won't change, and adding more columns just requires providing more parameters instead of converting the parameter to a list and extend it); (3) It's probably more Pythonic ;-) (4) Support for named expressions is really interesting (and I hope it will be enabled by default soon) since it makes the names of the new columns visible before their definition, as in pandas' .assign (using .alias means that the name can only be found at the end of the expression). By using *args instead of a list in with_columns, we increase the consistency since one can easily go from a **kwargs-based call to an *args-based call and vice-versa:

df.with_columns(
  expr1, 
  expr2, 
  expr3,
)

can be easily converted to/from:

df.with_columns(
   a=expr1, 
   b=expr2, 
   c=expr3,
)
arturdaraujo commented 1 year ago

I also like the idea of pl.Config.with_columns_kwargs = True being enabled by default. When using black code format the code "expands". I'm not familiar with the reasoning for using lists or if changing it to **kwargs-based would be worth it, though.

AlexandreDecan commented 1 year ago

I'm not suggesting to go for a full (and only) kwargs-based api but for a combination of *args and *kwargs. In particular I think it would make more sense to use args instead of a list. But for consistency, that means that select, filter, col, etc should probably be "converted" to receive *args instead of a list, which may lead to backward incompatibilities...

alexander-beedie commented 1 year ago

@AlexandreDecan: the ergonomic downside with *args style is passing in an existing (or programmatically-generated) list of expressions into the method; you would be amazed at how easy it is for people to forget to splat them in (eg: with_columns(*expressions)) and then wander all over the code wondering where the error is.

I've designed significant APIs both ways in the past, but that tends to take users a while to really ingrain (which isn't to say I don't like it, as I currently have an API using this style at work - however, it was designed to be that way from the start, and is consistent across all interfaces). Without a compelling advantage (vs some minor niceties), I doubt we'd want to make a breaking change like that across multiple commonly-used methods without something/someone really driving it...

AlexandreDecan commented 1 year ago

Thanks for your answer! I understand the rationale. But in that case, I would be in favour, for consistency, not to allow the use of **kwargs and rather to allow to pass a dictionary instead of a list to support named expressions, i.e., df.with_columns({'a': expr1, 'b': expr2, 'c': expr3}) instead of df.with_coumns(a=expr1, b=expr2, c=expr3).

That way, the number of parameters is always exactly 1, and there's a parallel between passing a list of unnamed expressions and passing a dict of named expressions (this parallel could have been achieved by using *args and **kwargs, but I understand you prefer to avoid them).

alexander-beedie commented 1 year ago

Thanks for your answer! I understand the rationale. But in that case, I would be in favour, for consistency, not to allow the use of **kwargs and rather to allow to pass a dictionary instead of a list to support named expressions, i.e.,

That would be very unergonomic - there is no reason for **kwargs style calls (an extremely common calling convention in python) to squeeze into a single parameter as a dict; I'd argue that would lose the reason for supporting this mode in the first place, which is primarily about how simple/pythonic it is ;)

AlexandreDecan commented 1 year ago

Perhaps less ergonomic, but consistent with your choice for not supporting *args. Anyway, I prefer to have **kwargs rather than nothing :-p

Apart from this, and going back to your previous message in which you said "the ergonomic downside with *args style is passing in an existing (or programmatically-generated) list of expressions into the method;", what about supporting all these uses cases? For example:

def with_columns(self, *args, **kwargs):
  if len(args) == 1:
    if isinstance(args[0], list):
      args = args[0]
   ...
alexander-beedie commented 1 year ago

consistent with your choice for not supporting *args

@AlexandreDecan: I admit I still don't see the parallel, but it seems we may be going for it after all! Note that I was pointing out that it isn't a free lunch, not that it isn't worthwhile - see here for details of the proposal :)

AlexandreDecan commented 1 year ago

Good news :-)

zundertj commented 1 year ago

Coming back to the naming of with_columns, on Discord there was the suggestion select_with, which emphasizes that with_columns(new_expr) is equivalent to .select([pl.all(), new_expr])). I also like that it no longer has the columns in the name, we also do not have filter_rows but filter, select rather than select_columns, etc. Downside is that with_columns is somewhat established, but I think overall this is an improvement. Any thoughts?

AlexandreDecan commented 1 year ago

It doesn't really highlight that the data frame is returned as-is with new columns (ie there is no projection).

ritchie46 commented 1 year ago

I really don't think we should remove with_columns. The expressions are added to the DataFrame as new columns. So I think the name is perfect.

alexander-beedie commented 1 year ago

This consolidation was completed for the 0.16 release, with a deprecation period via transparent @redirect. Prepare to autocomplete that extra "s" , there is now just one method ;) 👍

shahankhatch commented 1 year ago

Thanks @zundertj for bringing up this topic here. I was the one who recommended select_with on Discord, so I'm adding my comment here.

It's obvious to me that with_columns adds new columns but I think its equivalence to select(pl.all()) doesn't contribute to a simple and consistent terminology.

The latest move from with_column to with_columns shows a desire to add new keywords sparingly rather than for convenience. I think this goal can be applied further to select with an arg to select all (perhaps v messy,), or to use select_with (which becomes a polars-specific keyword but which appears beside the select in the api).

Since there's the extensible api, perhaps another approach that doesn't remove the conveniences is to move with_columns into a utility 'crate' to mimic the myriad functions that motivated the introduction of with_column and with_columns in the first place. It would only be used by those coming from pyspark-ish environments, and I imagine they'd want more than just with_columns.

Seeing the pyspark api for select, selectExpr, withColumn, withColumns, withColumnRenamed, I'm thankful that polars with_column is being obviated by with_columns

Looking forward to any feedback

ghuls commented 1 year ago

with_columns is not only used to add new columns, but also to replace existing ones, so with pl.all, you would get duplicated column name errors.

shahankhatch commented 1 year ago

The same applies to with_columns and groupby, so to me this point doesn't dissuade from cleaning up the api. In fact, the SQL standard doesn't disallow using the same alias. The following is valid SQL: SELECT designation, hired_on as designation FROM employees This speaks to me as though polars could evolve to more closely match the robust and familiar SQL specification.

Although it should be noted/discussed separately from this thread, one could argue that polars query planning should accommodate aliases created within a select/with_columns/agg expression for reuse within another array member. The semantics of an array of expressions right now is focused on parallelism. This forces users to manually create the 'intermediate nodes' in the query plan by chaining select/with_columns expressions only after the alias is declared, making the expressions longer and thus less concise.

mcrumiller commented 1 year ago

I like with_columns. It's not a big deviation from what we've been using for the past year when editing one column. Also, with_column (singular) secretly let you operate on multiple columns if you had multiple columns in your pl.col(pl.Float32) (for example), which always seemed wrong/a mistake to me. I didn't create an issue because it seemed like a can of worms which has now thankfully been resolved by the current issue.

I do want to add a grievance I have currently about with_columns that I swore I wrote about elsewhere but I can't find it anywhere: it's not obvious when a column will get replaced and when a new one will be created. If I do:

df.with_columns(col('a')*col('b'))

The default behavior is to modify column a because it happens to come first, despite commutativity. If I call

when(col('a') < 3).then(col('a')).otherwise(col('b'))

I get a modified again. But if I say:

when(col('b') < 9).then(col('a')).otherwise(col('b'))

...column a still gets modified! Is it because it's the first then?

This is super confusing and not obvious at all.