Open maximweb opened 8 months ago
This is the kind of thing that I'd like to see developed in a separate package first, then adopted into scipy once completed (if it still makes sense to do so with the data gathered from real use) rather than developed inside of scipy.
Doing some research, I found a paper which I can access:
Gorry, P. A. (2002). "General least-squares smoothing and differentiation of nonuniformly spaced data by the convolution method." Analytical Chemistry 63(5): 534-536. [1]
I am in the process of implementing it. Preliminary results are promising. Odd-sized windows work, but even-sized windows are still a little off. Not sure how to fix that yet.
A quick outlook (just an abritrary timeit comparison):
scipy: 0.03243189997738227 # original savgol_filter
polyfit: 6.405896900017979 # fitting all polynomials individually
gorry: 0.1077934000059031 # based on linked paper
I will publish the code, as soon as I find the time to clean it up a little.
It looks good if gorry is only 3 times slower.
AFAICS, Gorry article is for unequal spaced points. An alternative for piecewise constant spacing would be to stitch together SG on subsamples. The main work is then to smooth the connecting points which will require separate filters for the window length observations.
Aside: I have a stalled PR for statsmodels that includes binning to move from unequal spaced points to equal spaced points. That is a similar idea to what is used for fast 1-dim kernel density estimation (using fft convolution on binned, equal spaced data). https://github.com/statsmodels/statsmodels/pull/3492
I don't remember any details (nor the references for the binning approach). This would apply less to piecewise constant spacing where spacings might differ by a larger amount.
A question: Is there any recommended way to test the performance? Maybe even some standardized test signals available to use?
It looks good if gorry is only 3 times slower.
Turns out this is not always the case. This was for a fairly small signal (1k points and polyorder=2
). For larger signals it doesn't scale as good. Increasing the polyorder
also reduces performance, as the Gorry approach recursively generates the coefficients in a loop over polyorder
.
For now, I have focused on getting it to work. And after dealing with multiple edge-cases (even window, edge points), it now gives me identical results to scipy
's savgol_filter
. While I am not a professional programmer, even I still see some room for improvements in my prototype. That's why I haven't uploaded it yet.
AFAICS, Gorry article is for unequal spaced points. An alternative for piecewise constant spacing would be to stitch together SG on subsamples. The main work is then to smooth the connecting points which will require separate filters for the window length observations.
That's what I was thinking initially, but my application is not that performance hungry. Hence, I'd rather develop are more universal filter rather than dealing with my niche case and regret it the next time I have a truly nonuniform signal. In the long run, however, a combination using piece-wise classic savgol on equally-spaced sections and Gorry for the stitching sections might be the way to go.
So here is finally a draft of what I was able to come up with. I am not entirely convinced, but at least I did learn a lot during the process.
Some remarks regarding this implementation:
np.arange(0, polyorder+1)
.np.unique
on it, so that I only have to calculate parameters for uniquely spaced windows.np.unique
windows and did not center them to 0, but calculated every window regardless of possible repetition. There, I encountered weird spikes in the results when setting large polyorder
. I am not entirely sure where they were coming from. The algorithm requires the calculation of powered values of x for each window in the form
\left( \begin{matrix} x_0^0 & ... & x_{windowsize}^0\\ ... & ... & ... \\ x_0^{2*polyorder+1} & ... & x_{windowsize}^{2*polyorder+1}\end{matrix} \right)
So I can only suspect, that I ran in some kind of float
precision loss during calculation of $x_i^{2*polyorder+1}$. If it is in fact related to precision loss, this should also occur when have signals with very large $\Delta x$. Unfortunately I was not able to reproduce this.
Regarding performance, I did only some very basic tests. The basis was a sine signal with added artificial noise. Based on this, I created three different signals:
x-uniform
, solid linesx-partial
, dashed linesx-nonuniform
, dotted linesThese I filtered with three savgol
versions:
scipy.signal.savgol_filter
(blue)np.polynomial.polynomial.polyfit
over all windows -> polyfit
(green)nonunfirm
(red)Regarding signal length, the performance lies inbetween scipy
version and referency polyfit
. It does not scale as well as scipy
-version. Calculating only uniquely spaced windows kicks in, for larger signal lengths as the uniformly spaced segments remain the same, whereas the ratio of unique transition windows relative to the total signal length decreases.
To be fully transparent, for a signal with $N_{points} = 10^5$, my implementation scales poorly, even exceeding the reference polyfit
. Not sure whether this is my implementation or the algorithm itself...
polyorder
is intrinsic to the Gorry algorithm, as it requires recursive calculation of weight factors.After seeing the performance tests, I think it is not yet suited for scipy
. At least in it's current form. If you have any recommendations regarding my hobbyist-style implementation, I am open for suggestions.
Is your feature request related to a problem? Please describe.
I've been using
scipy
's Savitzky-Golay filter multiple times in the past. Now I have a task where I would like to apply the filter to non-uniformly spaced independent data.In my application, the independent data are not entirely randomly spaced, but originate from lab data where the measurement interval is successively increased, e.g. 0,1 s interval for 1 hour, followed by 1 s interval for the rest of the day, followed by 10 s interval for some days, etc.
Describe the solution you'd like.
I am by far not a signal nor mathematics expert, but I am still wondering whether it would be possible (or desired) to extent the
scipy.signal.savgol_filter
for the derivatives of non-uniformly spaced data. This could be triggered by setting thedelta
parameter not to a single value, but to an array of deltas, having the same size as the data.I found a implementation on stackexchange and a github repo based on the very same proposal.
As far as I understand, the current implementation is limited to uniformly spaced independent data (using
delta
andderiv
parameters) because this reduces to a single set of coefficients and very efficient folding operation. There's also concerns that a significant increase in runtime by changing a single parameter (delta
) could be at least surprising for users with large datasets. Yet it thought it might still be worth discussing, whether any form of semi-fast implementation might be worth looking into.So here are some ideas to speed things up: At least for my desired application, the independent variable is not entirely random. Hence, a sort-of piece-wise savgol filter could be employed. Considering my example from above, for example. It would not be necessary to recalculate the
savgol_coeffs
for each window, as thedelta
varies only occasionally. This would result in a reduced set of deltas for which the coefficients would have to be calculated, e.g. for a window length of 5:Obviously it is strongly dependent on the level of non-uniformity of
delta
, how many sets of coefficients are needed.I also found a reference to a paper pointing towards polynomial something called "similarity transformations" in the same context. Unfortunately, I neither have access to the paper, nor do I fully understand the implications of this method.
Any thoughts on this?
Describe alternatives you've considered.
No response
Additional context (e.g. screenshots, GIFs)
No response