pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.26k stars 1.85k forks source link

Add np.linspace-style function #5255

Open mcrumiller opened 1 year ago

mcrumiller commented 1 year ago

Problem description

pl.arange() does not allow non-integer step sizes. This can be worked around but having the option for non-integer endpoints and step sizes would be a nice feature. In the meanwhile, here is a workaround:

import polars as pl
import numpy as np
import math

low, high, step = 6.7, 10.3, 0.3

# numpy version
x_np = np.arange(low, high, step)

# polars version
def arange_float(high, low, step):
    return pl.arange(
        low=0,
        high=math.floor((high-low)/step),
        step=1,
        eager=True
    ).cast(pl.Float32)*step + low

def linspace(high, low, num_points):
    step = (high-low)/(num_points-1)
    return arange_float(high, low, step)

x_pl = arange_float(high, low, step)

print(x_np)
print(x_pl)

Not sure why np included 10.3, might be float rounding, but regardless:

[ 6.7  7.   7.3  7.6  7.9  8.2  8.5  8.8  9.1  9.4  9.7 10.  10.3]
shape: (12,)
Series: 'arange' [f32]
[
        6.7
        7.0
        7.3
        7.6
        7.9
        8.2
        8.5
        8.8
        9.1
        9.4
        9.7
        10.0
]
stinodego commented 9 months ago

I would welcome an implementation of float_range / float_ranges!

Wainberg commented 9 months ago

@stinodego thoughts on having a single pl.range / pl.ranges that covers both int and float? It could auto-infer the type based on the input arguments, and potentially have a dtype argument to override the auto-inferred type.

Also, would you be open to supporting pl.linspace as well, to match np.linspace? It's often convenient to specify the number of steps instead of the step size. It's also easy to make a mistake - for example, the OP got the implementation wrong! It should be something like arange(low, high + step / 2, step), not arange(low, high, step).

stinodego commented 9 months ago

thoughts on having a single pl.range / pl.ranges that covers both int and float? It could auto-infer the type based on the input arguments, and potentially have a dtype argument to override the auto-inferred type.

This might happen in the future. For now, we have specialized ranges for each type.

Not sure about supporting an equivalent for linspace for now. You can relatively easily write your own implementation using a float_range. Maybe in the future.

orlp commented 9 months ago

After some discussing some more with @stinodego we've agreed that float_range is problematic with regards to endpoint handling and float precision, so a linspace-style function would be better suited.

So for the moment the proposed design is

pl.linspace(start, stop, samples, closed="both" | "left" | "right")

samples indicates the number of values returned, similar to np.linspace. closed is a small generalization of numpy's endpoint=True parameter, best explained by example:

pl.linspace(0, 1, 4, closed="both") -> [0.0, 0.333..., 0.666..., 1.0]
pl.linspace(0, 1, 4, closed="left") -> [0.0, 0.25, 0.5, 0.75]
pl.linspace(0, 1, 4, closed="right") -> [0.25, 0.5, 0.75, 1.0]

This function will also support inverted ranges:

pl.linspace(1, 0, 4, closed="both") -> [1.0, 0.666..., 0.333..., 0.0]
pl.linspace(1, 0, 4, closed="left") -> [1.0, 0.75, 0.5, 0.25]
pl.linspace(1, 0, 4, closed="right") -> [0.75, 0.5, 0.25, 0.0]

Finally, the input dtypes allowed are all numeric types (albeit always with a float output), but also dates and times.

The only open question is the name of the function. We're not a huge fan of the name linspace as it clashes with Polars naming policy, so we're open to brainstorming for an alternative name. Options include (but are not limited to):

mcrumiller commented 9 months ago

@orlp how about pl.grid? It's both simple and obvious. It also opens up the door (if we want) to creating, say, an N-D grid via a struct or array:

Single dimension

>>> pl.grid(0, 1, 4, closed=Both) -> [0.0, 0.333..., 0.666..., 1.0]
shape: (4,)
Series: '' [f64]
[
        0.0
        0.333333
        0.666667
        1.0
]

Two dimensions

>>> pl.grid([0, 4], [2, 0], [4, 3], closed=Both])
shape: (12,)
Series: '' [struct[2]]
[
        {0.0,2}
        {0.0,1}
        {0.0,0}
        {0.333333,2}
        {0.333333,1}
        {0.333333,0}
        {0.666667,2}
        {0.666667,1}
        {0.666667,0}
        {1.0,2}
        {1.0,1}
        {1.0,0}
]
orlp commented 9 months ago

@mcrumiller We did consider grid but didn't like its 2D implication. We're not sure if we want a 2D (or N-D) version at this time.

mcrumiller commented 9 months ago

@orlp makes sense, and for N-D behavior I would be in favor of the name meshgrid anyway, which is what Matlab calls it and makes it a bit more obvious. I will say the implementation would be pretty easy if you utilized our existing cross-join behavior.

I think linspace is the best name here, as it's fairly well-known, and having the closed parameter allows the use of float range behavior with ease.

Wainberg commented 9 months ago

I really like the use of the closed parameter!

There are a few limitations with the proposed solution:

  1. pl.int_range() is only for integers, and pl.linspace() is only for floats. Sometimes you want linspace-like behavior for integers and vice versa, which is why np.arange() and np.linspace() support both integers and floats.
  2. pl.int_range() would also benefit from having a closed argument. Currently it behaves like closed='left', but it's pretty common to want closed='both', for instance.
  3. It's inconsistent to have int_range(), time_range(), date_range(), and datetime_range(), but not float_range().
  4. There's no linspaces() being proposed, to go with int_ranges().
  5. closed="none" isn't supported, which is inconsistent with time_range(), date_range(), datetime_range(), is_between(), etc. (Also, shouldn't it be "neither" instead of "none"?)
  6. samples makes it sound like random sampling. np.linspace() uses num instead of samples; you could also call it n_steps.

My proposed solution would be to cover the behavior of pl.int_range() and pl.linspace() in a single function:

pl.range(start, stop=None, step=None, *, n_steps=None, closed="both" | "left" | "right" | "neither", 
         dtype=None | PolarsNumericType)

pl.range(n) would be equivalent to pl.range(0, n), just like in Python. step and n_steps would be mutually exclusive, similar to how n and fraction are in Expr.sample. dtype would be auto-inferred (pl.Int64 if all arguments are ints, pl.Float64 if any argument is a float) or set to any numeric dtype. You could also have pl.ranges() similar to the current pl.int_ranges(). closed="none" would be renamed to closed="neither" for all polars functions and methods that support a closed parameter.

orlp commented 9 months ago

@Wainberg

  1. pl.linspace would also accept int inputs, but its outputs would always be floats (or datetimes/times/durations for those relevant types).
  2. Perhaps int_range could use the closed argument but that's for a different issue.
  3. It may be inconsistent, but float_range is just a highly problematic function due to the rounding errors introduced by IEEE 754 floating point. E.g. float_range(0, 0.9, 0.1) would result in [0, 0.1, ..., 0.8], but float_range(0, 0.9, 0.3) would result in [0, 0.3, 0.6, 0.8999999999999999] because in floating point arithmetic 0.1 * 9 >= 0.9 but 0.1 * 3 < 0.9.

    Note that numpy itself also recommends you to not use arange for float steps: "When using a non-integer step, such as 0.1, it is often better to use numpy.linspace." It also has a complete warning block explaining how "The length of the output might not be numerically stable." In a columnar dataframe library where we expect columns to have equal lengths within a dataframe that is a rather huge footgun.

  4. I should have specified that, yes, pl.linspaces would be included.
  5. I don't have an interpretation of what closed="none" could be for linearly spaced values (I agree w.r.t. neither vs none but not sure if it's worth changing).
  6. Numpy also calls them samples: "Returns num evenly spaced samples". Not a huge fan of n_steps because it's just not correct: in linspace(0, 1, 4) you take 3 steps to go from the start to the stop. And I don't think num is particularly descriptive.

I am not a huge fan of having a single function that covers both use-cases. The functions just do different things, especially with respect to their interpretation of closedness (for int_range the closedness only refers to the endpoints of the complete range, whereas for linspace it refers to how each sample should be interpreted). In general I think having arguments that are mutually exclusive with other arguments is poor design in Python. We should be removing cases where we do that, instead of adding more.

edavisau commented 7 months ago

My 2 cents on this: there is the issue #7525 for adding the periods argument to the pl.date_range function. I wanted to point out that adding this argument gives you both "arange"-type and "linspace"-type behaviour. So whether this linspace is consolidated into a *_range function, or it is separate, it might also make sense to do the same with date ranges.

Note that date/datetimes have an integer representation, so the issues regarding floating points still stand.

You can draw parallels with the pandas.date_range function. In pandas.date_range you have four parameter start, end, periods, and freq, and you must specify exactly 3, i.e. leave out one of them.

from datetime import datetime, timedelta
import polars as pl
import pandas as pd

start = datetime(2024, 1, 1)
end = datetime(2024, 1, 2)
periods = 3
freq = timedelta(hours=8)

# Combination 1: leave out `periods`
pd.date_range(start=start, end=end, freq=freq)
# Analogous to current `pl.int_range()` and `pl.datetime_range()`
pl.datetime_range(start=start, end=end, interval=freq, eager=True)
pl.int_range(start=0, end=10, step=1, eager=True)

# Combination 2: leave out `freq`
pd.date_range(start=start, end=end, periods=periods)
# Not implemented in polars, but this is np.linspace behaviour

# Combination 3: leave out `end`
pd.date_range(start=start, freq=freq, periods=periods)
# Best there is 
pl.datetime_range(start=start, end=start + freq * periods, interval=freq, closed="left", eager=True)
pl.int_range(start=2, end=2 + 3 * 10, step=3, eager=True)

# Combination 4: leave out `start`
pd.date_range(end=end, freq=freq, periods=periods)
# Best there is
pl.datetime_range(start=end - freq * periods, end=end, interval=freq, closed="right", eager=True)
pl.int_range(start=31 - 3 * 5, end=31, step=3, eager=True)
rben01 commented 2 months ago

My proposed solution would be to cover the behavior of pl.int_range() and pl.linspace() in a single function:

Big +1 from me. This is how Julia’s range function works by default and it's great.

  range(start, stop, length)
  range(start, stop; length, step)
  range(start; length, stop, step)
  range(;start, length, stop, step)

  Construct a specialized array with evenly spaced elements and optimized storage (an
  AbstractRange) from the arguments. Mathematically a range is uniquely determined by any three of
  start, step, stop and length. Valid invocations of range are:

    •  Call range with any three of start, step, stop, length.

    •  Call range with two of start, stop, length. In this case step will be assumed to be
       one. If both arguments are Integers, a UnitRange will be returned.

    •  Call range with one of stop or length. start and step will be assumed to be one. 
       [In Python we'd want start to default to 0]