scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
12.92k stars 5.14k forks source link

ENH: Savitsky-Golay filter for non uniformly spaced data #19812

Open maximweb opened 8 months ago

maximweb commented 8 months ago

Is your feature request related to a problem? Please describe.

I've been using scipy's Savitzky-Golay filter multiple times in the past. Now I have a task where I would like to apply the filter to non-uniformly spaced independent data.

In my application, the independent data are not entirely randomly spaced, but originate from lab data where the measurement interval is successively increased, e.g. 0,1 s interval for 1 hour, followed by 1 s interval for the rest of the day, followed by 10 s interval for some days, etc.

Describe the solution you'd like.

I am by far not a signal nor mathematics expert, but I am still wondering whether it would be possible (or desired) to extent the scipy.signal.savgol_filter for the derivatives of non-uniformly spaced data. This could be triggered by setting the delta parameter not to a single value, but to an array of deltas, having the same size as the data.

I found a implementation on stackexchange and a github repo based on the very same proposal.

As far as I understand, the current implementation is limited to uniformly spaced independent data (using delta and deriv parameters) because this reduces to a single set of coefficients and very efficient folding operation. There's also concerns that a significant increase in runtime by changing a single parameter (delta) could be at least surprising for users with large datasets. Yet it thought it might still be worth discussing, whether any form of semi-fast implementation might be worth looking into.

So here are some ideas to speed things up: At least for my desired application, the independent variable is not entirely random. Hence, a sort-of piece-wise savgol filter could be employed. Considering my example from above, for example. It would not be necessary to recalculate the savgol_coeffs for each window, as the delta varies only occasionally. This would result in a reduced set of deltas for which the coefficients would have to be calculated, e.g. for a window length of 5:

[0.1, 0.1, 0.1, 0.1, 0.1],
[0.1, 0.1, 0.1, 0.1, 1.0],
[0.1, 0.1, 0.1, 1.0, 1.0],
[0.1, 0.1, 1.0, 1.0, 1.0],
[0.1, 1.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 1.0],
...

Obviously it is strongly dependent on the level of non-uniformity of delta, how many sets of coefficients are needed.

I also found a reference to a paper pointing towards polynomial something called "similarity transformations" in the same context. Unfortunately, I neither have access to the paper, nor do I fully understand the implications of this method.

Any thoughts on this?

Describe alternatives you've considered.

No response

Additional context (e.g. screenshots, GIFs)

No response

rkern commented 8 months ago

This is the kind of thing that I'd like to see developed in a separate package first, then adopted into scipy once completed (if it still makes sense to do so with the data gathered from real use) rather than developed inside of scipy.

maximweb commented 8 months ago

Doing some research, I found a paper which I can access:

Gorry, P. A. (2002). "General least-squares smoothing and differentiation of nonuniformly spaced data by the convolution method." Analytical Chemistry 63(5): 534-536. [1]

I am in the process of implementing it. Preliminary results are promising. Odd-sized windows work, but even-sized windows are still a little off. Not sure how to fix that yet.

A quick outlook (just an abritrary timeit comparison):

scipy: 0.03243189997738227 # original savgol_filter
polyfit: 6.405896900017979 # fitting all polynomials individually
gorry: 0.1077934000059031 # based on linked paper

I will publish the code, as soon as I find the time to clean it up a little.

[1] https://pubs.acs.org/doi/abs/10.1021/ac00005a031

josef-pkt commented 8 months ago

It looks good if gorry is only 3 times slower.

AFAICS, Gorry article is for unequal spaced points. An alternative for piecewise constant spacing would be to stitch together SG on subsamples. The main work is then to smooth the connecting points which will require separate filters for the window length observations.

Aside: I have a stalled PR for statsmodels that includes binning to move from unequal spaced points to equal spaced points. That is a similar idea to what is used for fast 1-dim kernel density estimation (using fft convolution on binned, equal spaced data). https://github.com/statsmodels/statsmodels/pull/3492

I don't remember any details (nor the references for the binning approach). This would apply less to piecewise constant spacing where spacings might differ by a larger amount.

maximweb commented 8 months ago

A question: Is there any recommended way to test the performance? Maybe even some standardized test signals available to use?

It looks good if gorry is only 3 times slower.

Turns out this is not always the case. This was for a fairly small signal (1k points and polyorder=2). For larger signals it doesn't scale as good. Increasing the polyorder also reduces performance, as the Gorry approach recursively generates the coefficients in a loop over polyorder.

For now, I have focused on getting it to work. And after dealing with multiple edge-cases (even window, edge points), it now gives me identical results to scipy's savgol_filter. While I am not a professional programmer, even I still see some room for improvements in my prototype. That's why I haven't uploaded it yet.

AFAICS, Gorry article is for unequal spaced points. An alternative for piecewise constant spacing would be to stitch together SG on subsamples. The main work is then to smooth the connecting points which will require separate filters for the window length observations.

That's what I was thinking initially, but my application is not that performance hungry. Hence, I'd rather develop are more universal filter rather than dealing with my niche case and regret it the next time I have a truly nonuniform signal. In the long run, however, a combination using piece-wise classic savgol on equally-spaced sections and Gorry for the stitching sections might be the way to go.

maximweb commented 8 months ago

So here is finally a draft of what I was able to come up with. I am not entirely convinced, but at least I did learn a lot during the process.

Some remarks regarding this implementation:

Performance

Regarding performance, I did only some very basic tests. The basis was a sine signal with added artificial noise. Based on this, I created three different signals:

  1. with uniformly spaced $x$ -> x-uniform, solid lines
  2. with three uniformly spaced segments of $x$ -> x-partial, dashed lines
  3. with random noise added to all $x$ -> x-nonuniform, dotted lines

These I filtered with three savgol versions:

  1. scipy.signal.savgol_filter (blue)
  2. A reference implementation, which basically loops np.polynomial.polynomial.polyfit over all windows -> polyfit (green)
  3. The linked implementation of Gorry algorithm -> nonunfirm (red)

Signal length

signal_length

Regarding signal length, the performance lies inbetween scipy version and referency polyfit. It does not scale as well as scipy-version. Calculating only uniquely spaced windows kicks in, for larger signal lengths as the uniformly spaced segments remain the same, whereas the ratio of unique transition windows relative to the total signal length decreases.

Window length

To be fully transparent, for a signal with $N_{points} = 10^5$, my implementation scales poorly, even exceeding the reference polyfit. Not sure whether this is my implementation or the algorithm itself...

window N1e5

Conclusion

After seeing the performance tests, I think it is not yet suited for scipy. At least in it's current form. If you have any recommendations regarding my hobbyist-style implementation, I am open for suggestions.