Is your feature request related to a problem? Please describe.

I've been using scipy's Savitzky-Golay filter multiple times in the past. Now I have a task where I would like to apply the filter to non-uniformly spaced independent data.

In my application, the independent data are not entirely randomly spaced, but originate from lab data where the measurement interval is successively increased, e.g. 0,1 s interval for 1 hour, followed by 1 s interval for the rest of the day, followed by 10 s interval for some days, etc.

Describe the solution you'd like.

I am by far not a signal nor mathematics expert, but I am still wondering whether it would be possible (or desired) to extent the scipy.signal.savgol_filter for the derivatives of non-uniformly spaced data. This could be triggered by setting the delta parameter not to a single value, but to an array of deltas, having the same size as the data.

I found a implementation on stackexchange and a github repo based on the very same proposal.

As far as I understand, the current implementation is limited to uniformly spaced independent data (using delta and deriv parameters) because this reduces to a single set of coefficients and very efficient folding operation. There's also concerns that a significant increase in runtime by changing a single parameter (delta) could be at least surprising for users with large datasets. Yet it thought it might still be worth discussing, whether any form of semi-fast implementation might be worth looking into.

So here are some ideas to speed things up: At least for my desired application, the independent variable is not entirely random. Hence, a sort-of piece-wise savgol filter could be employed. Considering my example from above, for example. It would not be necessary to recalculate the savgol_coeffs for each window, as the delta varies only occasionally. This would result in a reduced set of deltas for which the coefficients would have to be calculated, e.g. for a window length of 5:

[0.1, 0.1, 0.1, 0.1, 0.1],
[0.1, 0.1, 0.1, 0.1, 1.0],
[0.1, 0.1, 0.1, 1.0, 1.0],
[0.1, 0.1, 1.0, 1.0, 1.0],
[0.1, 1.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 1.0],
...

Obviously it is strongly dependent on the level of non-uniformity of delta, how many sets of coefficients are needed.

I also found a reference to a paper pointing towards polynomial something called "similarity transformations" in the same context. Unfortunately, I neither have access to the paper, nor do I fully understand the implications of this method.

Any thoughts on this?

Describe alternatives you've considered.

No response

Additional context (e.g. screenshots, GIFs)

No response

This is the kind of thing that I'd like to see developed in a separate package first, then adopted into scipy once completed (if it still makes sense to do so with the data gathered from real use) rather than developed inside of scipy.

Doing some research, I found a paper which I can access:

Gorry, P. A. (2002). "General least-squares smoothing and differentiation of nonuniformly spaced data by the convolution method." Analytical Chemistry 63(5): 534-536. [1]

I am in the process of implementing it. Preliminary results are promising. Odd-sized windows work, but even-sized windows are still a little off. Not sure how to fix that yet.

A quick outlook (just an abritrary timeit comparison):

scipy: 0.03243189997738227 # original savgol_filter
polyfit: 6.405896900017979 # fitting all polynomials individually
gorry: 0.1077934000059031 # based on linked paper

I will publish the code, as soon as I find the time to clean it up a little.

[1] https://pubs.acs.org/doi/abs/10.1021/ac00005a031

It looks good if gorry is only 3 times slower.

AFAICS, Gorry article is for unequal spaced points. An alternative for piecewise constant spacing would be to stitch together SG on subsamples. The main work is then to smooth the connecting points which will require separate filters for the window length observations.

Aside: I have a stalled PR for statsmodels that includes binning to move from unequal spaced points to equal spaced points. That is a similar idea to what is used for fast 1-dim kernel density estimation (using fft convolution on binned, equal spaced data). https://github.com/statsmodels/statsmodels/pull/3492

I don't remember any details (nor the references for the binning approach). This would apply less to piecewise constant spacing where spacings might differ by a larger amount.

A question: Is there any recommended way to test the performance? Maybe even some standardized test signals available to use?

It looks good if gorry is only 3 times slower.

Turns out this is not always the case. This was for a fairly small signal (1k points and polyorder=2). For larger signals it doesn't scale as good. Increasing the polyorder also reduces performance, as the Gorry approach recursively generates the coefficients in a loop over polyorder.

For now, I have focused on getting it to work. And after dealing with multiple edge-cases (even window, edge points), it now gives me identical results to scipy's savgol_filter. While I am not a professional programmer, even I still see some room for improvements in my prototype. That's why I haven't uploaded it yet.

AFAICS, Gorry article is for unequal spaced points. An alternative for piecewise constant spacing would be to stitch together SG on subsamples. The main work is then to smooth the connecting points which will require separate filters for the window length observations.

That's what I was thinking initially, but my application is not that performance hungry. Hence, I'd rather develop are more universal filter rather than dealing with my niche case and regret it the next time I have a truly nonuniform signal. In the long run, however, a combination using piece-wise classic savgol on equally-spaced sections and Gorry for the stitching sections might be the way to go.

So here is finally a draft of what I was able to come up with. I am not entirely convinced, but at least I did learn a lot during the process.

Some remarks regarding this implementation:

The Gorry algorithm relies on recursive calculation of weight factors, which I did not now how to do other than using a loop over np.arange(0, polyorder+1).
Regarding the individual windows, I was able to do this with matrix formula, avoiding an outer loop. This might be memory intensive, but is faster than an outer loop.
I separated the start values for all recursively calculated parameters in separate conditions. This could be done in a more compressed manner in less conditions. For the sake of clarity, I kept it separated for now.
With the initial problem of partially uniform $x$ values in mind, I calculated $\Delta x{window} - x{window, center}$ and ran np.unique on it, so that I only have to calculate parameters for uniquely spaced windows.
In an intermediate version, I did not use np.unique windows and did not center them to 0, but calculated every window regardless of possible repetition. There, I encountered weird spikes in the results when setting large polyorder. I am not entirely sure where they were coming from. The algorithm requires the calculation of powered values of x for each window in the form
```
\left( \begin{matrix} x_0^0 & ... & x_{windowsize}^0\\ ... & ... & ... \\ x_0^{2*polyorder+1} & ... & x_{windowsize}^{2*polyorder+1}\end{matrix} \right)
```
So I can only suspect, that I ran in some kind of float precision loss during calculation of $x_i^{2*polyorder+1}$. If it is in fact related to precision loss, this should also occur when have signals with very large $\Delta x$. Unfortunately I was not able to reproduce this.

Performance

Regarding performance, I did only some very basic tests. The basis was a sine signal with added artificial noise. Based on this, I created three different signals:

with uniformly spaced $x$ -> x-uniform, solid lines
with three uniformly spaced segments of $x$ -> x-partial, dashed lines
with random noise added to all $x$ -> x-nonuniform, dotted lines

These I filtered with three savgol versions:

scipy.signal.savgol_filter (blue)
A reference implementation, which basically loops np.polynomial.polynomial.polyfit over all windows -> polyfit (green)
The linked implementation of Gorry algorithm -> nonunfirm (red)

Signal length

signal_length

Regarding signal length, the performance lies inbetween scipy version and referency polyfit. It does not scale as well as scipy-version. Calculating only uniquely spaced windows kicks in, for larger signal lengths as the uniformly spaced segments remain the same, whereas the ratio of unique transition windows relative to the total signal length decreases.

Window length

To be fully transparent, for a signal with $N_{points} = 10^5$, my implementation scales poorly, even exceeding the reference polyfit. Not sure whether this is my implementation or the algorithm itself...

window N1e5

Conclusion

Scaling seems to be an issue with my implementation. For small signals its better than just fitting polynomials per window, but this seems to change for larger signals and window length.
Poor scaling with polyorder is intrinsic to the Gorry algorithm, as it requires recursive calculation of weight factors.
I will keep investigating, where the mentioned spikes came from, if I ever encounter them again.
For partially uniform signals, using uniquely spaced windows helps, but that strongly depends on the data.

After seeing the performance tests, I think it is not yet suited for scipy. At least in it's current form. If you have any recommendations regarding my hobbyist-style implementation, I am open for suggestions.

scipy / scipy

ENH: Savitsky-Golay filter for non uniformly spaced data #19812