Closed MarcoGorelli closed 7 months ago
I do think that having this function is convenient, but it should just work for LazyFrames
and desugar to something like you mentioned. That is, I think the function shouldn't be in-place.
Two more thoughts:
drop_at_idx
?Should insert_at_idx
allow a frame as input? Or is this getting too complex. I could see if you wanted to insert df2
somewhere into df1
you could have either:
# insert each column at a different location
df1 = df1.insert_at_idx(df2, idx=[2, 4, 3])
or
# insert the whole frame as columns 2, 3, and 4.
df1 = df1.insert_at_idx(df2, idx=2)
Thanks both for your comments!
Sure, no objections to keeping (and making it not-in-place and available for LazyFrame), although in my PR I've renamed it to insert_at_index
so that the deprecation warning can inform about the behaviour change (EDIT: closing my PR for now while I make sense of a few things, if anyone wants to take over please feel free to do so)
To be honest I think that accepting a DataFrame as well as argument introduces too much complexity
What about inserting before or after a particular column name? Tidyverse's mutate()
has .before
and .after
arguments:
.before, .after <tidy-select> Optionally, control where new columns should appear (the default is to add to the right hand side). See relocate() for more details.
relocate()
can even move blocks of columns at once!
Thanks for the suggestion - personally, I think it adds too much complexity, df.columns
and df.select
should be able to let you reorder your columns however you want
I'm having trouble seeing how, could you give an example?
# 5-column df
df.select(df.columns[i] for i in [2, 4, 1, 3, 0])
Oh sorry, I meant to insert series
into df
after a specific column "foo". You could do:
df.pipe(lambda df: df.insert_at_idx(df.columns.index('foo') + 1, series))
But that's long and clunky, only supports one column, and doesn't work on expressions.
Yeah, for that I'd probably write a helper function:
def insert_after(df, s, after_column):
idx = df.columns.index(after_column)
return df.select(
pl.col(df.columns[0:idx+1]),
s,
pl.col(df.columns[idx+1:])
)
Nice!
tbh this looks fine to me - the polars API already gives you the tools to do pretty much whatever customised insertion you want
I do think that having this function is convenient, but it should just work for LazyFrames and desugar to something like you mentioned
@orlp just checking - are you suggesting that it just be a convenience function on the Python side which does what I suggested, or that it needs implementing for the Rust API too?
Thinking about it again, I'm concerned about it opening up a rabbit hole of feature requests (like other customised insertion methods), so I'm leaning back towards the "just deprecate it" side. But, no strong opinion here
@MarcoGorelli @mcrumiller would there be interest in adding .before()
and .after()
methods on expressions? Then you could insert one or more columns after a particular column via:
df.with_columns(pl.col('foo', 'bar').mul(2).prefix('double_').after('baz'))
Or rearrange via:
df.with_columns(pl.col('foo', 'bar').after('baz'))
You could allow .before() and .after() to support integers as well, which would let you deprecate df.insert_at_idx(index, series)
and replace it with:
df.with_columns(series.after(index))
df.with_columns(series.after(index))
This is not how expressions work. They are meant as operators not position locators.
What would be the Pythonic (or whatever the polars equivalent is) way of doing this in a method chain? So far the most concise way I've found is:
df.pipe(lambda df: df.insert_at_idx(df.columns.index('column_to_insert_after') + 1, series))
and that's for one column, and doesn't work with expressions.
Can you use pipe and then @mcrumiller 's insert_after
function
That's fair, yeah. Though if we're talking about writing a utility function for it, would it be worth thinking about integrating it natively into polars? The tidyverse folks felt it was important enough to make .before
and .after
two of the only dedicated arguments to mutate()
, which is one of the most used functions in tidyverse:
mutate(
.data,
...,
.by = NULL,
.keep = c("all", "used", "unused", "none"),
.before = NULL,
.after = NULL
)
I'm happy to make a feature enhancement request, or drop the issue, whichever you'd prefer. I would say that if this were included in polars, I would use it very often!
As in, to add before
and after
to with_columns
? I think this could potentially break people's code if they're insert a column named 'before':
In [7]: df.with_columns(before='a')
Out[7]:
shape: (3, 3)
┌─────┬─────┬────────┐
│ a ┆ b ┆ before │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪════════╡
│ 1 ┆ 4 ┆ 1 │
│ 2 ┆ 5 ┆ 2 │
│ 3 ┆ 6 ┆ 3 │
└─────┴─────┴────────┘
Not saying it can't be done, just trying to weigh things up against each other
Oh yeah for sure, tidyverse presumably uses .before
and .after
as the argument names for exactly that reason, to avoid name clashes with columns named before and after.
I was thinking of .before()
and .after()
as expressions, notwithstanding Ritchie's point that position locators aren't meant to be expressions. Really any way of making this easy that works for multiple columns (and ideally also for expressions) would be a big deal for usability!
would be a big deal for usability!
Right, looks like it's time to ping Stijn "ergonomic API" de Gooijer @stijndegooijer - any thoughts on this?
I am thinking about this one. Is there a way we can mark it as unidiomatic. Indicating that it can be used, but is more a lower level method that can save an allocation when you want modify table.
It will almost always be slower if you need this method more than once.
While we're on it, I would really like a .replace(name)
as an alternative to .alias(name)
, or maybe replace_with_alias(old_name=new_name)
. I can make a separate feature request for this. I encounter this scenario a lot:
df = df.with_columns(
col(old_column).operations().alias(new_column)
).drop(old_column)
@mcrumiller can you keep that in a separate feature request. Then we can this one scoped to insert_at_idx.
@orlp just checking - are you suggesting that it just be a convenience function on the Python side which does what I suggested, or that it needs implementing for the Rust API too?
@mcrumiller Well, I'm hinting at a future direction for Polars which we don't have yet. Currently Polars doesn't really have a 'public Rust API', we just expose our internals directly. In the future we would want a Python and a Rust public API with feature parity that call/desugar to the internal API. The internal API would be significantly smaller / less magic.
@MarcoGorelli @stinodego just wanted to follow up on the idea of implementing some equivalent of tidyverse's widely used .before
and .after
arguments to dplyr::mutate()
, which insert one or more columns before or after a particular column.
One option I discuss here is to add Expr.before()
and Expr.after()
. But any way of making this easy that works for multiple columns (and ideally also for expressions) would be a big deal for usability!
This method has recently been renamed. It is now called DataFrame.insert_column
.
I think these types of methods can be useful. Not sure why it's in-place, though. I don't think we should really expose more similar methods on DataFrame
as preference should go to do things with expressions.
With regards to before
/after
: this would be a useful extension to with_columns
, where the current default is to insert at the end. I'd have to think about the proper API but I can definitely see it be useful.
I think the conclusion here for now is "do not deprecate". Though there have been some other ideas in this issue that deserve their own issue (in-place behavior, before/after inserts, ...). I'll close this for now.
[discussed in person at EuroScipy, just making an issue before it's forgotten]
DataFrame.insert_at_idx is a bit at odds with the rest of the polars API:
I'd suggest it just be deprecated - if users want this kind of functionality, they can easily do something like