Improve `WhittakerSmooth`, `AirPLS`, and `ArPLS` performance - sparse matrix operations

Description

AirPLS (Adaptive Iteratively Reweighted Penalized Least Squares) and ArPLS (Asymmetrically Reweighted Penalized Least Squares) are powerful algorithms for removing complex non-linear baselines from spectral signals. However, their computational cost can be significant, especially when processing large numbers of spectra. Currently, we use the csc_matrix representation from scipy.sparse to optimize performance, but further improvements are needed.

Improving Attempts

To improve the performance, I have tried just-in-time compilation of some key functions using numba. However, numba does not support the csc_matrix type, and I cannot JIT compile the code. To overcome this issue, I thought of looking for a numba compatible representation of sparse matrices, but could not find one. Therefore, I have created my own, together with some functions to make basic algebra operations with them code to Gist. Unfortunately, this did not improve the performance over the current implementation.

Hacktoberfest Challenge

We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve the performance of both algorithms

Here are some ideas to work on:

Find a more efficient way to JIT compile the code using tools like numba.
Investigate parallel or distributed computing techniques to speed up the processing of multiple spectra.

How to Contribute

Here is the contributing guidelines

Contact

We can have the the conversation in the Issue or the Discussion

Resources

Here are some relevant resources and references for understanding the theory and implementation of the AirPLS and ArPLS algorithms:

Paper on AirPLS: Z.-M. Zhang, S. Chen, and Y.-Z. Liang, Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst 135 (5), 1138-1146 (2010).
Paper on ArPLS: Sung-June Baek, Aaron Park, Young-Jin Ahn, Jaebum Choo Baseline correction using asymmetrically reweighted penalized least squares smoothing

Hello,

Suspected Main Problem

As far as I can judge, the problem also affects the WhittakerSmooth-class because all of them rely on the sparse matrix representation of the involved weighting and penalty matrices. This is way more performant than using the pure dense NumPy-Arrays, yet dense NumPy-Arrays are part of the solution here. With scipy.sparse most of the functions (especially the spsolve) do not distinguish between general sparse matrices that can have nonzeros at any position, e.g.,

A = [[ a    0    0    0    b]
     [ 0    c    0    d    0]
     [ 0    e    f    0    g]
     [ 0    0    0    h    0]
     [ 0    0    0    i    j]]

and the highly structured sparse format encountered in all the described functionalities (I'll call them Whittaker-Like) which is banded (only a few central diagonals are non-zero) and symmetric, e.g.,

B = [[ a    b    0    0    0]
     [ b    c    d    0    0]
     [ 0    d    e    f    0]
     [ 0    0    f    g    h]
     [ 0    0    0    h    i]]

Usually sparse solvers rely on LU-decomposition, but as far as I'm aware, the LU-decomposition of a sparse matrix like A is not necessarily sparse as well but dense, so the gain is not as much as it could be even when using a high-performance sparse solver like spsolve from pypardiso.

Proposed solution

In contrast to this, B has an LU-decomposition that is (almost) as sparse as B itself with (almost) the same banded structure, so the performance can be increased quite a lot. The LAPACK-wrapper solve_banded from scipy.linalg exploits this together with a special banded storage as dense Array which makes the excecution way faster. Actually, one can go down even further. Since for Whittaker-Like problems, the matrix to invert is I + lam * D.T @ D where I is the identity matrix and D.T @ D is the square of the finite difference matrix D (which is of order difference), the problem simplifies. D.T @ D has difference zero eigenvalues and if one adds I to it, one basically lifts the main diagonal of D.T @ D. So, all the resulting eigenvalues are lifted to be > 0 (at least mathematically, but not numerically in all cases). This allows to rely on Cholesky decomposition rather than LU-decomposition which is again faster because it only computes a lower triangular matrix L instead of a lower and an upper triangular matrix L and U (but the banded structure (almost) remains for both). Again, SciPy offers the LAPACK-wrapper solveh_banded for doing so. Typically what I do is I first attempt to use solveh_banded and upon a numpy.linalg.LinalgError which indicates that the Cholesky decomposition was not possible, I fall back to the LU-decomposition via solve_banded which is more stable due to the fact that it also uses partial pivoting (row swapping) to ensure stability.

a_banded = ... # matrix to invert in banded storage with `n_upp` super- and `n_upp` sub-diagonals above or below the main diagonal
b = ... # right hand side vector
try:
    # Attempt banded Cholesky decomposition
    x = solveh_banded(
        a_banded[0 : n_upp + 1, ::], # taking only the superdiagonals and the main diagonal since symmetric
        b,
        lower = False, # only the superdiagonals and main diagonal were taken
    )

except np.linalg.LinalgError:
    # Fall back to banded LU decomposition
    x = solve_banded(
        l_and_u = (n_upp, n_upp), # number of super- and subdiagonals
        a_banded,
        b,
    )

Further tweaks

What I just described is as far as one can get for the general case of arbitrary difference order of D. However, there are special cases that allow for further enhancement:

difference = 1: this results in a tridiagonal I + lam * D.T @ D which can be solved even faster. This is already included in scipy.linalg.solve_banded for LU-decomposition, but unfortunately, there is no equivalent for scipy.linalg.solveh_banded for Cholesky-factorisation
difference = 2: this results in a pentadiagonal I + lam * D.T @ D for whose solution the Python package pentapy is highly optimised. As an added bonus differences = 2 is used in many baseline correction algorithms, so I'd say this will be the best speedup for the Air- and ArPLS.

pentapy also gives an overview over the performances of a variety of banded solvers which indicates that there is still a lot of performance to gain when moving away from SciPy's spsolve: SparseBandedSolversPerformance

Besides, there is another added bonus. All these functions can solve (I + lam * D.T @ D)Z = Y for Z also when Y has multiple columns, i.e., spectra to smooth. So, the same expensive factorisation - after computed only once - can be used for solving each column in Y for the corresponding column in Z which is way less expensive since the solving is only a forward and a backward substitution. On top of that, the loops then also happen in low-level languages and not in Python anymore. Consequently, the Python loops in all the Whittaker-Like functions of chemotools could simply be dropped for performance.

Hi @IruNikZe thanks a lot for the fantastic explanation and it looks super promising! I will turn this issue into a branch and you can add your changes there!!

@IruNikZe I have assigned you to that issue, and this is the branch you can add your changes too :smile: :smile:

44-improve-airpls-and-arpls-performance-sparse-matrix-operations

Hi-hi, It's been a while, so I wanted to give a quick update. From testing I figured out two things:

1) Numerical instability of Whittaker smoother

Problem For difference orders m > 4 and large signal n > 1000 solving the linear system (W + lamba * D^T @ D)z = Wy becomes instable. Unfortunately, these sizes are in the order of magnitude of the typical size of spectra and m = 6 is required for choosing lambda based on the signal in a spatially adaptive fashion (which is the next step I would want to tackle in a later issue). The problem affects both the banded Cholesky decomposition as well as the more robust (but more expensive) banded pivoted LU-decomposition. pentapy fails as well because it uses the same matrix to solve the system.

Cause The problem comes from squaring the linear system before solving because (W + lamba * D^T @ D) = (sqrt(W)^T @ sqrt(W) + sqrt(lambda) * sqrt(lambda) * D^T @ D, so

the weights sqrt(W) are being squared which boosts any imbalanced weight distributions even further and also causes more imbalances in normal weights (compare 1 and 1.5 which are close, but squared 1 and 2.25 are more apart, and here the order of magnitude was the same, so it can get way more worse)
sqrt(lambda) which can be arbitrarily small or large is also squared and this affects
the squaring of D which is the most critical in this context. Since D does not have full row rank (less rows than columns in this case), D^T @ D has m zero eigenvalues, so solving a system where lambda is so large that basically D^T @ D is the only numerically present summands causes a breakdown of the system.

All in all, this blows up the condition number of A = (W + lamba * D^T @ D) since it is squared. High condition numbers indicate that the system is ill-conditioned and slightest perturbations have strong effects on the final solution. With the limited precision of float64, this causes build-up of rounding errors until the system - even though solvable from a mathematical point of view - cannot be solved numerically (see this and related Wikipedia articles).

Partial solution While it's not possible to properly cure the case when lambda is so large that the system becomes too ill-conditioned to be solved (at least this problem should not and cannot be tackled from within the solver), all the other cases can be handled way by avoiding the squaring just described. This requires some insight in how the solver works, but basically A is decomposed in two uniquely defined triangular matrices L and L^T that have the same entries, but are transposed, i.e., A = L @ L^T. Triangular matrices are easy to solve, but we dont need uniquely defined matrices because A = S @ S^T can be solved in the exact same way as for L as long as S is lower triangular. One such decomposition can be derived when writing A = B^T @ B where B is a full matrix with no special structure. From the above, it is obvious that B = [[sqrt(W)], [sqrt(lambda) * D]] (if you multiply it out one by one, you get A). Now, B needs to be triangularized, which can be achieved by QR-decomposition B = Q @ R. Here, Q is an orthonormal matrix (Q^T = Q^(-1); Q^T @ Q = I), and R is upper triangular. With this A = B^T @ B = R^T @ Q^T @ Q @ R = R^T @ I @ R = R^T @ R, i.e., we get a triangular decomposition, but A was never formed by squaring which leaves the condition number relatively low (it is not squared like before), i.e., the system can be solved with a strongly improved numerical accuaracy. Since Q cancels out, only R needs to be formed. To keep the computations accurate, I resorted to a floating point exact sparse Givens rotation as well as fused-multiply-add operations (FMA). For this I went down to Cython (1) that I combined with SciPy's cython_lapack for solving the triangular system after the QR-decomposition. I'll go for a draft pull request first for this because we need to think about packaging in more detail then and probably first need to set up the packaging in the CI pipeline (if we wanna go down this road at all). If not, I also can re-write it in NumPy which will be slower and less accurate since it has no FMA, or Numba which also does not easily offer an FMA.

(1) and also a Rust implementation which would probably make the package more scalable as functionality grows and new external functionalities are required which can be kind of a nightmare in Cython.

Added bonus Since the QR-decomposition was some effort to code and also optimize, I went a step further and extended the system of the Whittaker smoother to solve (K^T @ W_1 @ K + lambda * D^T @ W_2 @ D)z = K^T @ W_1 @ y where W_2 will be need later for spatial adaptiveness. The added bonus is the kernel matrix K which also needs to be banded. Assume your spectrum was blurred by, e.g., a Lorentzian peak (like the natural linewidth), this can deconvolved out and z will not become a smooth version of y, but a smooth version with the peak width reduced since the Lorentzian is "missing". Of course other point spread functions can also be deconvolved out with this. The original smoothing solver just has K = I.

2) Performance of baseline algorithms

Problem The baseline algorithms can take some time for converging.

Cause The initialisation of baseline algorithms is key here. Right now, they are initialised with a constant vector, but with a more elaborate initialisation, the convergence can be sped up dramatcially.

Solution Typically, one would divide the signal in like 20 windows, find the minima of these windows and interpolate them. This estimate is often very very close to the baseline (given that lambda is well chosen). Alongside this, I would also give the baseline two modes:

the baseline is below the peaks (e.g., absorbance spectroscopy)
the baseline is above the peaks (e.g., transmittance spectroscopy)

Implications I would have to adapt the tests for ArPLS and AirPLS then because the different initialisation will change the results. Thus, the tests will fail because they just reference to the result of the very initial implementation (at least that's my guess).

Any feedback is highly welcome, but I guess when the draft pull request is submitted, things will get clearer 😁

Hello, Since I took over from my account @IruNikZe, I want to give an update here. So far I

unified the implementations of WhittakerSmooth, ArPls, and AirPls
replaced the sparse solvers and even all sparse matrices which come with an incredible performance penalty due to the initialisation overhead (magnitudes longer than what an optimized solver takes). Now, everything is computed in compact dense storage which was a bit more involved in terms of logic but significantly faster
added weights and an automated lambda selection to the WhittakerSmooth. The automated selection is designed modular. Currently, it relies on the log marginal likelihood, but something like (Generalized) Cross-Validation would be a nice addition here, so I left it more in a plug-and-play fashion from an implementation point of view
tested all new functionalities (1 test for the automated lambda yet to be implemented)
parallelized the tests to make them run in a fraction of the original time

None of the changes broke the API. However, it is extended quite a bit for the WhittakerSmooth where lam now allows fixed and automatically fitted smoothing.

Besides that there are some TODOs where we need to improve the numerical stability for large signals, but I will open two separate issues for this: one for the implementation itself and one for extending the packaging by Cython-builds because we will need access to some LAPACK functions that don't have fancy high level wrappers in SciPy and need to be accessed via Cython, namely dgbcon and dpbcon. These functions will allow us to run the Whittaker solver even faster because they offer a sanity check for the results of the current and a faster method. Such a check is currently missing and made me resort to scipy.linalg.solve_banded (current method mentioned) which can be too conservative and thus slow in some cases, but also too instable in other scenarios. So right now, we have a good level of safety which can still be improved by enabling more performance (scipy.linalg.solveh_banded (faster method mentioned)) and also an even more safe fallback.

Now, I still need to

merge the current main into the respective branch
finalise the docstrings and comments (@paucablop I might need some helps regarding the package docs)
change the baseline initialisation in ArPls and AirPls because the first iterations from a zero baseline can be far away from the true baseline which can be estimated better

Since the Pull Request will be quite massive already, I would for now not touch the baseline initialisation (which will require new classes) and open a Pull Request with what is there this week (as a draft due to the package docs).

@paucablop Should we leave the branch open and then do another Pull Request when the new baseline initialisation is finished to make sure everything is updated by the branch linked to this issue?

I wish a great start into the week 👋 😸

I'd already like to add this tiny teaser for the automated smoothing as a result of the tests. I tried to mimic two peaks with Poisson-noise sitting on a baseline. I drew 10 repetitions of random Poisson noise and used the new weighting functionality to show that regions with high noise (lower weights) are smoothed stronger and vice versa 😸

WhittakerLambdaAutomatedSmoothing

Even though the smoothing itself is repeated quite often to find the optimum lambda (like 40 times or so), this still runs in less than 5 milliseconds on my relatively old machine 🏃

@paucablop In case you already want to see the new functionalities of WhittakerSmooth, you can try it out in this notebook:

whittaker_show.txt

After you pulled the current version of the issue/feature branch, all you need to do is:

replace the .txt-file-ending of the notebook by .ipynb
start the notebook in an environment that has jupyter installed (Python >= 3.9)
select that environment as the kernel
replace the inital path by the path where chemotools is installed on your machine (keep the \ on Windows in mind)

Then, you can go through the notebook. It has short documentations as Markdown cells whenever required. It also provides you timings and actually I need to correct what I said. The automated smoothing takes 10 ms on my machine 🙃 I tuned to optimiser (first brute force than finish by a targetting optimiser) to make sure it does not miss the global optimum, but I guess this can still be improved in terms of number of evaluations, so we did not reach the limit here 😬

Note: the docstring of WhittakerSmooth does not yet cover the automated lambda fitting. You can define a search space and an optimisation criterion by setting lam = (lower_bound, upper_bound, criterion), i.e., a tuple where the first two elements define the search space and the third element says which criterion should be optimized. Currently, the only criterion is "logml" for the log marginal likelihood. It is very powerful (according to the literature it seems more favourable than cross-validation) and also used by sklearns GaussianProcessRegressor.

I timed the implementation against this Rust-based package. I found that:

on a single smooth level, we are slower (450 µs in the notebook vs. 60 µs)
however, for automatic lambda-fitting, we are faster (10 ms µs in the notebook vs 30 ms)

I was not really able to reproduce our results though because the weights seem to be defined in a different manner (from 0 to 1 rather than variance-based).

So, I guess we are on a good way here. In the future, this could be further improved by moving the main logic to Cython or also Rust. Yet, we are sklearn-compatible and we use partially pivoted $LU$-decomposition, which should be more numerically stable than the $LDL^{T}$-decomposition used by Rust-package.

Hi @MothNik

I went through the branch, and it looks really good, very nice job, and also very well documented :smile: Super exciting to start testing it out. Also thanks a lot for enabling a nice testing environment. I really like the auto lambda functionality :star_struck:! I have started testing it out on my end (I could not wait :stuck_out_tongue_closed_eyes: ) and will keep doing so during this week (I have so many use cases I want to test the functions :yum:).

I think that it would be nice to start the PR process when you are ready. I could give you a hand with the documentation (eventually, I would like to change the documentation site, but I think for now we keep it as it is).

I usually like to keep the branches short, but since this would be a very related development, I think it is a good idea to leave the branch open and do another PR later on.

I am very much looking forward to the PR, and once again, thank you very much for your brilliant contribution!

Hi @paucablop 👋

You said you had many use cases, but the automatic smoothing requries weights and I was not sure whether you can get weights easily. Usually one uses the reciprocal of the standard deviations from replicate measurements, but I don't know if you have them at hand. This made me think 🧠 🏋️ that more generally, probably not all users can provide them and limiting the automated smoothing to this scenario seems a bit "discriminating" to me 👀

Therefore, I added a utility function to estimate the standard deviation in a data-driven fashion from the signal alone. I adapted the so-called DER SNR-metric, but spiced it up a little 🌶️ to not only enable global but also local noise estimation for signals where the noise level is not constant 🔉🔊🔉. It's called estimate_noise_stddev which can now be imported from both chemotools.utils and chemotools.smooth. Besides, the computation is straightforward and relies on some really good NumPy- and SciPy-functions and is thus blazingly fast ⚡

The results for the test signal look promising: Noise Estimation All_Noise_Level_Estimation Automated smoothing with these estimates All_Noise_Level_Smoothing

The function has a docstring that reminds more of a SciPy-function, but I tried to keep the API as open as possible because it's the first implementation of such a feature. It was quite a hazzle to test it because it's hard to track the algorithm step by step. I resorted to Calc (Open Excel; the sheet can be found here) where I did all the calculations semi-automatically for multiple settings step by step and saved them for the use as reference fixtures 😅 But now, it can finally be presented to the world 🏟️

However, you can get really far with the defaults already. I demonstrated it in Section 7 of the new, improved playground notebook

whittaker_show_v2_noise_estim.txt

The setup is the same as before

After you pulled the current version of the issue/feature branch, all you need to do is:

replace the .txt-file-ending of the notebook by .ipynb

start the notebook in an environment that has jupyter installed (Python >= 3.9)

select that environment as the kernel

replace the inital path by the path where chemotools is installed on your machine (keep the \ on Windows in mind)

I will finish the final timings of the improvements end of this week, but otherwise I'm ready for the Pull Request.

Best regards

@paucablop After I've seen #56, I renamed the window_size of the function to window_length as for SciPy's savgol_filter. Here is an updated notebook that works with the current branch version:

whittaker_show_v3_noise_estim.txt

Edit Actually, I'm not sure if window_size wasn't better? The function relies on scipy.ndimage.median_filter which has size as the respective argument.

@paucablop Do you have any strong opinion on this? I can easily revert the last commit I did.

@MothNik Just saw the PR, very exciting!!

I have been checking the three functions with some of use-cases, and it is working really good, it is significantly faster, on the data I used to benchmark these two versions, I could see around an order of magnitude improvement 🚀🚀, and numerical stability (spectra of around 3500 datapoints) 🦾🦾. Very good job!!!

I indeed have the opportunity to estimate the weights from the inversed standard deviation of several measurements, but I know it is not always the case, so I am supper happy you suggested a method to do so, I did not know this method before, but after reading the attached paper you included, it seems like a very smart idea, and brilliant addition to the methods 💡💡Also, having the function estimate_noise_stddev accessible from utils makes it quite nice to have, I think there can also for other interesting use cases for it, next week I would like to spend some more time playing with it. Thanks for being so diligent on the documentation and referencing of the functions! 🤓🤓

Regarding the windows_size, windows_length, I have no strong opinions, I think that it makes sense calling it windows_size as you also point out it is also used within scipy!

I will start jump to the PR right away!!! 😄😄 Thanks again for the great job @MothNik !!

@paucablop You're welcome! 😸 I will quickly rename it back to window_size then 💪

@paucablop Renamed and tested 👍

paucablop / chemotools