Make `int_range()` and `int_ranges()` work with no inputs and default to `int_range(0, pl.len())`

pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

https://docs.pola.rs

Other

30.19k stars 1.95k forks source link

Make `int_range()` and `int_ranges()` work with no inputs and default to `int_range(0, pl.len())` #17810

Open deanm0000 opened 3 months ago

deanm0000 commented 3 months ago

Description

It seems from questions in the discord and on SO that int_range is mostly used as an index so I think it'd be a little more convenient if we could drop typing in pl.len()

So if we have df=pl.DataFrame({'a':[2,5,8]})

then instead of df.with_columns(z=pl.int_range(pl.len())) we could just do df.with_columns(z=pl.int_range())

Of course in this specific example it'd be quicker to do df.with_row_index('z') but in the cases where we're using it in an over or whatever it's just a small QOL improvement. Since it currently raises when there are no inputs, there shouldn't be any backward compatibility issues with this change.

I can make a PR to do this, just want to make sure it's what people want.

cmdlineluser commented 3 months ago

Coincidentally, the pl.row_index() issue received a bump earlier today.

https://github.com/pola-rs/polars/issues/12420

deanm0000 commented 3 months ago

Actually, not a coincidence I saw the bump and it reminded me of asking for this default.

deanm0000 commented 3 months ago

@ritchie46 is this one ok to do or do you have any objections?

Oreilles commented 3 months ago

This feels counter-intuitive to me: I would expect a range function called without specified bounds to raise. Adding pl.row_index() (with the same implementation you propose) to the API would make more sense in my opinion.

EDIT: Furthermore, pl.int_range() could not be used as a join expression since it doesn't return an elementwise expression (see https://github.com/pola-rs/polars/issues/12420#issuecomment-2247836534)

mcrumiller commented 3 months ago

Chiming in, I think pl.index() is better than pl.row_index(). It's simpler, and all horizontal expressions use horizontal_, so the "row" part is unnecessary.

It also seems to be a super obvious operation and very common to use, why is there any argument about it?

deanm0000 commented 3 months ago

@mcrumiller I, for one, am not arguing against pl.index or pl.row_index or whatever other name it could have. It's just that those were already ruled out so this is a middle ground.

@Oreilles Lots of things are counterintuitive until you get used to them. That said, I don't really share that sense of it being counterintuitive since it exists in a context that has a length that is a natural default so why not make it the default?

Oreilles commented 3 months ago

Intuitiveness certainely is subjective, but using pl.int_range() to generate an index has some drawbacks that would cripple its use case compared to an actual row_index function.

int_range doesn't reflex the intent in the way row_index or with_row_index does (would with_int_range make sense ? Maybe, but arguably less intuitively).
int_range defaults to Int64 and index columns generated with with_row_index are UInt32, so they cannot be merged on by default.
int_range cannot be used as a join expression... which would likely be the primary use case for using it instead of with_row_index (see https://github.com/pola-rs/polars/issues/12420#issuecomment-2247836534)

deanm0000 commented 1 month ago

@Oreilles your arguments seem to be for having another expression to generate a numeric index rather than against having int_range have defaults when no parameters are passed.

It already defaults to starting at 0 when only 1 parameter is passed so why not have it default the length parameter to the context length when no parameters are input?