pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.22k stars 1.67k forks source link

Deprecate `DataFrame.insert_at_idx` #10528

Closed MarcoGorelli closed 7 months ago

MarcoGorelli commented 10 months ago

[discussed in person at EuroScipy, just making an issue before it's forgotten]

DataFrame.insert_at_idx is a bit at odds with the rest of the polars API:

I'd suggest it just be deprecated - if users want this kind of functionality, they can easily do something like

        columns = dataframe.columns
        new_columns = columns[:index] + [new_column_name] + columns[index:]
        dataframe = dataframe.with_columns(series.alias(new_column_name)).select(new_columns)
orlp commented 10 months ago

I do think that having this function is convenient, but it should just work for LazyFrames and desugar to something like you mentioned. That is, I think the function shouldn't be in-place.

mcrumiller commented 10 months ago

Two more thoughts:

  1. Should we also have a corresponding drop_at_idx?
  2. Should insert_at_idx allow a frame as input? Or is this getting too complex. I could see if you wanted to insert df2 somewhere into df1 you could have either:

    # insert each column at a different location
    df1 = df1.insert_at_idx(df2, idx=[2, 4, 3])

    or

    # insert the whole frame as columns 2, 3, and 4.
    df1 = df1.insert_at_idx(df2, idx=2)
MarcoGorelli commented 10 months ago

Thanks both for your comments!

Sure, no objections to keeping (and making it not-in-place and available for LazyFrame), although in my PR I've renamed it to insert_at_index so that the deprecation warning can inform about the behaviour change (EDIT: closing my PR for now while I make sense of a few things, if anyone wants to take over please feel free to do so)

To be honest I think that accepting a DataFrame as well as argument introduces too much complexity

Wainberg commented 10 months ago

What about inserting before or after a particular column name? Tidyverse's mutate() has .before and .after arguments:

.before, .after <tidy-select> Optionally, control where new columns should appear (the default is to add to the right hand side). See relocate() for more details.

relocate() can even move blocks of columns at once!

MarcoGorelli commented 10 months ago

Thanks for the suggestion - personally, I think it adds too much complexity, df.columns and df.select should be able to let you reorder your columns however you want

Wainberg commented 10 months ago

I'm having trouble seeing how, could you give an example?

mcrumiller commented 10 months ago
# 5-column df
df.select(df.columns[i] for i in  [2, 4, 1, 3, 0])
Wainberg commented 10 months ago

Oh sorry, I meant to insert series into df after a specific column "foo". You could do:

df.pipe(lambda df: df.insert_at_idx(df.columns.index('foo') + 1, series))

But that's long and clunky, only supports one column, and doesn't work on expressions.

mcrumiller commented 10 months ago

Yeah, for that I'd probably write a helper function:

def insert_after(df, s, after_column):
    idx = df.columns.index(after_column)
    return df.select(
        pl.col(df.columns[0:idx+1]),
        s,
        pl.col(df.columns[idx+1:])
    )
MarcoGorelli commented 10 months ago

Nice!

tbh this looks fine to me - the polars API already gives you the tools to do pretty much whatever customised insertion you want

I do think that having this function is convenient, but it should just work for LazyFrames and desugar to something like you mentioned

@orlp just checking - are you suggesting that it just be a convenience function on the Python side which does what I suggested, or that it needs implementing for the Rust API too?

Thinking about it again, I'm concerned about it opening up a rabbit hole of feature requests (like other customised insertion methods), so I'm leaning back towards the "just deprecate it" side. But, no strong opinion here

Wainberg commented 10 months ago

@MarcoGorelli @mcrumiller would there be interest in adding .before() and .after() methods on expressions? Then you could insert one or more columns after a particular column via:

df.with_columns(pl.col('foo', 'bar').mul(2).prefix('double_').after('baz'))

Or rearrange via:

df.with_columns(pl.col('foo', 'bar').after('baz'))

You could allow .before() and .after() to support integers as well, which would let you deprecate df.insert_at_idx(index, series) and replace it with:

df.with_columns(series.after(index))

ritchie46 commented 9 months ago

df.with_columns(series.after(index))

This is not how expressions work. They are meant as operators not position locators.

Wainberg commented 9 months ago

What would be the Pythonic (or whatever the polars equivalent is) way of doing this in a method chain? So far the most concise way I've found is:

df.pipe(lambda df: df.insert_at_idx(df.columns.index('column_to_insert_after') + 1, series))

and that's for one column, and doesn't work with expressions.

MarcoGorelli commented 9 months ago

Can you use pipe and then @mcrumiller 's insert_after function

Wainberg commented 9 months ago

That's fair, yeah. Though if we're talking about writing a utility function for it, would it be worth thinking about integrating it natively into polars? The tidyverse folks felt it was important enough to make .before and .after two of the only dedicated arguments to mutate(), which is one of the most used functions in tidyverse:

mutate(
  .data,
  ...,
  .by = NULL,
  .keep = c("all", "used", "unused", "none"),
  .before = NULL,
  .after = NULL
)

I'm happy to make a feature enhancement request, or drop the issue, whichever you'd prefer. I would say that if this were included in polars, I would use it very often!

MarcoGorelli commented 9 months ago

As in, to add before and after to with_columns? I think this could potentially break people's code if they're insert a column named 'before':

In [7]: df.with_columns(before='a')
Out[7]:
shape: (3, 3)
┌─────┬─────┬────────┐
│ a   ┆ b   ┆ before │
│ --- ┆ --- ┆ ---    │
│ i64 ┆ i64 ┆ i64    │
╞═════╪═════╪════════╡
│ 1   ┆ 4   ┆ 1      │
│ 2   ┆ 5   ┆ 2      │
│ 3   ┆ 6   ┆ 3      │
└─────┴─────┴────────┘

Not saying it can't be done, just trying to weigh things up against each other

Wainberg commented 9 months ago

Oh yeah for sure, tidyverse presumably uses .before and .after as the argument names for exactly that reason, to avoid name clashes with columns named before and after.

I was thinking of .before() and .after() as expressions, notwithstanding Ritchie's point that position locators aren't meant to be expressions. Really any way of making this easy that works for multiple columns (and ideally also for expressions) would be a big deal for usability!

MarcoGorelli commented 9 months ago

would be a big deal for usability!

Right, looks like it's time to ping Stijn "ergonomic API" de Gooijer @stijndegooijer - any thoughts on this?

ritchie46 commented 9 months ago

I am thinking about this one. Is there a way we can mark it as unidiomatic. Indicating that it can be used, but is more a lower level method that can save an allocation when you want modify table.

It will almost always be slower if you need this method more than once.

mcrumiller commented 9 months ago

While we're on it, I would really like a .replace(name) as an alternative to .alias(name), or maybe replace_with_alias(old_name=new_name). I can make a separate feature request for this. I encounter this scenario a lot:

df = df.with_columns(
    col(old_column).operations().alias(new_column)
).drop(old_column)
ritchie46 commented 9 months ago

@mcrumiller can you keep that in a separate feature request. Then we can this one scoped to insert_at_idx.

orlp commented 9 months ago

@orlp just checking - are you suggesting that it just be a convenience function on the Python side which does what I suggested, or that it needs implementing for the Rust API too?

@mcrumiller Well, I'm hinting at a future direction for Polars which we don't have yet. Currently Polars doesn't really have a 'public Rust API', we just expose our internals directly. In the future we would want a Python and a Rust public API with feature parity that call/desugar to the internal API. The internal API would be significantly smaller / less magic.

Wainberg commented 9 months ago

@MarcoGorelli @stinodego just wanted to follow up on the idea of implementing some equivalent of tidyverse's widely used .before and .after arguments to dplyr::mutate(), which insert one or more columns before or after a particular column.

One option I discuss here is to add Expr.before() and Expr.after(). But any way of making this easy that works for multiple columns (and ideally also for expressions) would be a big deal for usability!

stinodego commented 7 months ago

This method has recently been renamed. It is now called DataFrame.insert_column.

I think these types of methods can be useful. Not sure why it's in-place, though. I don't think we should really expose more similar methods on DataFrame as preference should go to do things with expressions.

With regards to before/after: this would be a useful extension to with_columns, where the current default is to insert at the end. I'd have to think about the proper API but I can definitely see it be useful.

I think the conclusion here for now is "do not deprecate". Though there have been some other ideas in this issue that deserve their own issue (in-place behavior, before/after inserts, ...). I'll close this for now.