pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.2k stars 1.84k forks source link

Edge interpolation #18095

Open nzqo opened 1 month ago

nzqo commented 1 month ago

Description

Currently, interpolation does not extend to missing values at the "edge" of the DataFrame

df = pl.DataFrame({"test" : [0, None, 1, None]}).interpolate()

yields

┌──────┐
│ test │
│ ---  │
│ f64  │
╞══════╡
│ 0.0  │
│ 0.5  │
│ 1.0  │
│ null │
└──────┘

While I can use fill_nulls to fill the value at the edge, there are many scenarios in which I find myself seeking the ability to linearly interpolate there as well. The best example is probably regular timeseries data, where the timestamp should not just be repeated at the end, but rather extended.

I believe this could either come in the form of a strategy in fill_nulls or as an option the interpolation expressions.

mcrumiller commented 1 month ago

The definition for interpolate says:

Interpolate intermediate values. The interpolation method is linear.

This makes sense. How can you linearly interpolate when you only have a single point?

nzqo commented 1 month ago

The definition for interpolate says:

Interpolate intermediate values. The interpolation method is linear.

This makes sense. How can you linearly interpolate when you only have a single point?

Well, I am technically talking about extrapolation, since this is about extension past boundaries.

Having just a single data point available is an edge case and I would suggest then it should just not do anything or raise an error/warning imo. In the example I gave it is definitely possible to linearly extrapolate though

mcrumiller commented 1 month ago

If you have a range of nulls, interpolate will take the values spanning that range and interpolate. When you have a range of nulls at the edge of your data, you have only a single point at one end, hence why you cannot interpolate.

Are you saying that polars should use the last two available non-null points to define the line that will be used in the extrapolation? This feels like we're in specific-scenario land at this point and a custom function of your own making would be best suited.

nzqo commented 1 month ago

Are you saying that polars should use the last two available non-null points to define the line that will be used in the extrapolation?

That would be one way, yes. I'd honestly be surprised if that was such an outlandish scenario. If you consider any time series that has missing values at the end, you'd run into this issue of not being apply to fill those Nulls without leaving the native API. However, you are right in that it probably shouldn't be part of "linear interpolation".

In pandas, I would just use a spline interpolation, which actually extends past the edges of data points, or an extrapolation. The former isn't available yet, while for the latter I am not sure how/whether I would be able to implement it with the current API. Thoughts on these two options?