Open deanm0000 opened 3 months ago
Coincidentally, the pl.row_index()
issue received a bump earlier today.
Actually, not a coincidence I saw the bump and it reminded me of asking for this default.
@ritchie46 is this one ok to do or do you have any objections?
This feels counter-intuitive to me: I would expect a range function called without specified bounds to raise.
Adding pl.row_index()
(with the same implementation you propose) to the API would make more sense in my opinion.
EDIT: Furthermore, pl.int_range()
could not be used as a join expression since it doesn't return an elementwise expression (see https://github.com/pola-rs/polars/issues/12420#issuecomment-2247836534)
Chiming in, I think pl.index()
is better than pl.row_index()
. It's simpler, and all horizontal expressions use horizontal_
, so the "row" part is unnecessary.
It also seems to be a super obvious operation and very common to use, why is there any argument about it?
@mcrumiller I, for one, am not arguing against pl.index
or pl.row_index
or whatever other name it could have. It's just that those were already ruled out so this is a middle ground.
@Oreilles Lots of things are counterintuitive until you get used to them. That said, I don't really share that sense of it being counterintuitive since it exists in a context that has a length that is a natural default so why not make it the default?
Intuitiveness certainely is subjective, but using pl.int_range()
to generate an index has some drawbacks that would cripple its use case compared to an actual row_index
function.
int_range
doesn't reflex the intent in the way row_index
or with_row_index
does (would with_int_range
make sense ? Maybe, but arguably less intuitively).int_range
defaults to Int64
and index columns generated with with_row_index
are UInt32
, so they cannot be merged on by default.int_range
cannot be used as a join expression... which would likely be the primary use case for using it instead of with_row_index
(see https://github.com/pola-rs/polars/issues/12420#issuecomment-2247836534)@Oreilles your arguments seem to be for having another expression to generate a numeric index rather than against having int_range
have defaults when no parameters are passed.
It already defaults to starting at 0 when only 1 parameter is passed so why not have it default the length parameter to the context length when no parameters are input?
Description
It seems from questions in the discord and on SO that
int_range
is mostly used as an index so I think it'd be a little more convenient if we could drop typing inpl.len()
So if we have
df=pl.DataFrame({'a':[2,5,8]})
then instead of
df.with_columns(z=pl.int_range(pl.len()))
we could just dodf.with_columns(z=pl.int_range())
Of course in this specific example it'd be quicker to do
df.with_row_index('z')
but in the cases where we're using it in anover
or whatever it's just a small QOL improvement. Since it currently raises when there are no inputs, there shouldn't be any backward compatibility issues with this change.I can make a PR to do this, just want to make sure it's what people want.