Open mcrumiller opened 1 year ago
I would welcome an implementation of float_range
/ float_ranges
!
@stinodego thoughts on having a single pl.range
/ pl.ranges
that covers both int and float? It could auto-infer the type based on the input arguments, and potentially have a dtype
argument to override the auto-inferred type.
Also, would you be open to supporting pl.linspace
as well, to match np.linspace
? It's often convenient to specify the number of steps instead of the step size. It's also easy to make a mistake - for example, the OP got the implementation wrong! It should be something like arange(low, high + step / 2, step)
, not arange(low, high, step)
.
thoughts on having a single pl.range / pl.ranges that covers both int and float? It could auto-infer the type based on the input arguments, and potentially have a dtype argument to override the auto-inferred type.
This might happen in the future. For now, we have specialized ranges for each type.
Not sure about supporting an equivalent for linspace
for now. You can relatively easily write your own implementation using a float_range
. Maybe in the future.
After some discussing some more with @stinodego we've agreed that float_range
is problematic with regards to endpoint handling and float precision, so a linspace
-style function would be better suited.
So for the moment the proposed design is
pl.linspace(start, stop, samples, closed="both" | "left" | "right")
samples
indicates the number of values returned, similar to np.linspace
. closed
is a small generalization of numpy's endpoint=True
parameter, best explained by example:
pl.linspace(0, 1, 4, closed="both") -> [0.0, 0.333..., 0.666..., 1.0]
pl.linspace(0, 1, 4, closed="left") -> [0.0, 0.25, 0.5, 0.75]
pl.linspace(0, 1, 4, closed="right") -> [0.25, 0.5, 0.75, 1.0]
This function will also support inverted ranges:
pl.linspace(1, 0, 4, closed="both") -> [1.0, 0.666..., 0.333..., 0.0]
pl.linspace(1, 0, 4, closed="left") -> [1.0, 0.75, 0.5, 0.25]
pl.linspace(1, 0, 4, closed="right") -> [0.75, 0.5, 0.25, 0.0]
Finally, the input dtypes allowed are all numeric types (albeit always with a float output), but also dates and times.
The only open question is the name of the function. We're not a huge fan of the name linspace
as it clashes with Polars naming policy, so we're open to brainstorming for an alternative name. Options include (but are not limited to):
pl.interval_sample
pl.interval_range
pl.linear_sample
pl.linear_space
@orlp how about pl.grid
? It's both simple and obvious. It also opens up the door (if we want) to creating, say, an N-D grid via a struct
or array
:
>>> pl.grid(0, 1, 4, closed=Both) -> [0.0, 0.333..., 0.666..., 1.0]
shape: (4,)
Series: '' [f64]
[
0.0
0.333333
0.666667
1.0
]
>>> pl.grid([0, 4], [2, 0], [4, 3], closed=Both])
shape: (12,)
Series: '' [struct[2]]
[
{0.0,2}
{0.0,1}
{0.0,0}
{0.333333,2}
{0.333333,1}
{0.333333,0}
{0.666667,2}
{0.666667,1}
{0.666667,0}
{1.0,2}
{1.0,1}
{1.0,0}
]
@mcrumiller We did consider grid
but didn't like its 2D implication. We're not sure if we want a 2D (or N-D) version at this time.
@orlp makes sense, and for N-D behavior I would be in favor of the name meshgrid
anyway, which is what Matlab calls it and makes it a bit more obvious. I will say the implementation would be pretty easy if you utilized our existing cross-join behavior.
I think linspace
is the best name here, as it's fairly well-known, and having the closed
parameter allows the use of float range behavior with ease.
I really like the use of the closed
parameter!
There are a few limitations with the proposed solution:
pl.int_range()
is only for integers, and pl.linspace()
is only for floats. Sometimes you want linspace-like behavior for integers and vice versa, which is why np.arange()
and np.linspace()
support both integers and floats.pl.int_range()
would also benefit from having a closed
argument. Currently it behaves like closed='left'
, but it's pretty common to want closed='both'
, for instance.int_range()
, time_range()
, date_range()
, and datetime_range()
, but not float_range()
.linspaces()
being proposed, to go with int_ranges()
.closed="none"
isn't supported, which is inconsistent with time_range()
, date_range()
, datetime_range()
, is_between()
, etc. (Also, shouldn't it be "neither"
instead of "none"
?)samples
makes it sound like random sampling. np.linspace()
uses num
instead of samples
; you could also call it n_steps
.My proposed solution would be to cover the behavior of pl.int_range()
and pl.linspace()
in a single function:
pl.range(start, stop=None, step=None, *, n_steps=None, closed="both" | "left" | "right" | "neither",
dtype=None | PolarsNumericType)
pl.range(n)
would be equivalent to pl.range(0, n)
, just like in Python. step
and n_steps
would be mutually exclusive, similar to how n
and fraction
are in Expr.sample
. dtype
would be auto-inferred (pl.Int64
if all arguments are ints, pl.Float64
if any argument is a float) or set to any numeric dtype. You could also have pl.ranges()
similar to the current pl.int_ranges()
. closed="none"
would be renamed to closed="neither"
for all polars functions and methods that support a closed
parameter.
@Wainberg
pl.linspace
would also accept int inputs, but its outputs would always be floats (or datetimes/times/durations for those relevant types).int_range
could use the closed argument but that's for a different issue.It may be inconsistent, but float_range
is just a highly problematic function due to the rounding errors introduced by IEEE 754 floating point. E.g. float_range(0, 0.9, 0.1)
would result in [0, 0.1, ..., 0.8]
, but float_range(0, 0.9, 0.3)
would result in [0, 0.3, 0.6, 0.8999999999999999]
because in floating point arithmetic 0.1 * 9 >= 0.9
but 0.1 * 3 < 0.9
.
Note that numpy itself also recommends you to not use arange
for float steps: "When using a non-integer step, such as 0.1, it is often better to use numpy.linspace." It also has a complete warning block explaining how "The length of the output might not be numerically stable." In a columnar dataframe library where we expect columns to have equal lengths within a dataframe that is a rather huge footgun.
pl.linspaces
would be included.closed="none"
could be for linearly spaced values (I agree w.r.t. neither vs none but not sure if it's worth changing).n_steps
because it's just not correct: in linspace(0, 1, 4)
you take 3 steps to go from the start to the stop. And I don't think num
is particularly descriptive.I am not a huge fan of having a single function that covers both use-cases. The functions just do different things, especially with respect to their interpretation of closed
ness (for int_range
the closedness only refers to the endpoints of the complete range, whereas for linspace
it refers to how each sample should be interpreted). In general I think having arguments that are mutually exclusive with other arguments is poor design in Python. We should be removing cases where we do that, instead of adding more.
My 2 cents on this: there is the issue #7525 for adding the periods
argument to the pl.date_range
function. I wanted to point out that adding this argument gives you both "arange"-type and "linspace"-type behaviour. So whether this linspace is consolidated into a *_range
function, or it is separate, it might also make sense to do the same with date ranges.
Note that date/datetimes have an integer representation, so the issues regarding floating points still stand.
You can draw parallels with the pandas.date_range
function. In pandas.date_range
you have four parameter start
, end
, periods
, and freq
, and you must specify exactly 3, i.e. leave out one of them.
from datetime import datetime, timedelta
import polars as pl
import pandas as pd
start = datetime(2024, 1, 1)
end = datetime(2024, 1, 2)
periods = 3
freq = timedelta(hours=8)
# Combination 1: leave out `periods`
pd.date_range(start=start, end=end, freq=freq)
# Analogous to current `pl.int_range()` and `pl.datetime_range()`
pl.datetime_range(start=start, end=end, interval=freq, eager=True)
pl.int_range(start=0, end=10, step=1, eager=True)
# Combination 2: leave out `freq`
pd.date_range(start=start, end=end, periods=periods)
# Not implemented in polars, but this is np.linspace behaviour
# Combination 3: leave out `end`
pd.date_range(start=start, freq=freq, periods=periods)
# Best there is
pl.datetime_range(start=start, end=start + freq * periods, interval=freq, closed="left", eager=True)
pl.int_range(start=2, end=2 + 3 * 10, step=3, eager=True)
# Combination 4: leave out `start`
pd.date_range(end=end, freq=freq, periods=periods)
# Best there is
pl.datetime_range(start=end - freq * periods, end=end, interval=freq, closed="right", eager=True)
pl.int_range(start=31 - 3 * 5, end=31, step=3, eager=True)
My proposed solution would be to cover the behavior of
pl.int_range()
andpl.linspace()
in a single function:
Big +1 from me. This is how Julia’s range
function works by default and it's great.
range(start, stop, length)
range(start, stop; length, step)
range(start; length, stop, step)
range(;start, length, stop, step)
Construct a specialized array with evenly spaced elements and optimized storage (an
AbstractRange) from the arguments. Mathematically a range is uniquely determined by any three of
start, step, stop and length. Valid invocations of range are:
• Call range with any three of start, step, stop, length.
• Call range with two of start, stop, length. In this case step will be assumed to be
one. If both arguments are Integers, a UnitRange will be returned.
• Call range with one of stop or length. start and step will be assumed to be one.
[In Python we'd want start to default to 0]
Problem description
pl.arange()
does not allow non-integer step sizes. This can be worked around but having the option for non-integer endpoints and step sizes would be a nice feature. In the meanwhile, here is a workaround:Not sure why np included 10.3, might be float rounding, but regardless: