scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
13.08k stars 5.19k forks source link

ENH: optimize `LbfgsInvHessProduct.todense()`, 10x speed improvement for size 1000 #18105

Open Gattocrucco opened 1 year ago

Gattocrucco commented 1 year ago

Is your feature request related to a problem? Please describe.

Using optimize.minimize(..., method='l-bfgs-b').inv_hess.todense() (code) I noticed its implementation does $O(n^3)$ operations when it could be $O(n^2)$. I made a few tries at optimizing it. The methods I considered are:

1) modifying LbfgsInvHessProduct.todense to avoid matrix by matrix multiplications 2) implementing LbfgsInvHessProduct._matmat and doing obj.matmat(eye(*obj.shape)) 3) using the stock implementation of LbfgsInvHessProduct._matmat

Code and benchmark:

from scipy import optimize
import numpy as np

sk, yk = np.random.randn(2, 10, 1000)
invh = optimize.LbfgsInvHessProduct(sk, yk)

def todense(self):
    """ Rewrite LbfgsInvHessProduct.todense avoiding matrix @ matrix """
    s, y, n_corrs, rho = self.sk, self.yk, self.n_corrs, self.rho
    Hk = np.eye(*self.shape, dtype=self.dtype)

    for i in range(n_corrs):
        # H  <--  (I - rho sy') H (I - rho ys') + rho ss' =
        #       = H - rho Hys' - rho sy'H + rho^2 sy'Hys' + rho ss' =
        #       = H - rho Hys' - rho sy'H + rho(1 + rho y'Hy) ss'
        rhoHys = -rho[i] * np.outer(Hk @ y[i], s[i])
        Hk += rho[i] * (1 + rho[i] * (y[i] @ Hk @ y[i])) * np.outer(s[i], s[i])
        Hk += rhoHys
        Hk += rhoHys.T

    return Hk

def _matmat(self, X):
    """ Rewrite LbfgsInvHessProduct._matvec broadcasting on columns of X """
    s, y, n_corrs, rho = self.sk, self.yk, self.n_corrs, self.rho
    Q = np.array(X, dtype=self.dtype, copy=True)
    assert Q.ndim == 2

    alpha = np.empty((n_corrs, Q.shape[1]))

    for i in range(n_corrs-1, -1, -1):
        alpha[i] = rho[i] * np.dot(s[i], Q)
        Q = Q - alpha[i]*y[i][:, np.newaxis]

    R = Q
    for i in range(n_corrs):
        beta = rho[i] * np.dot(y[i], R)
        R = R + s[i][:, np.newaxis] * (alpha[i] - beta)

    return R

inv1 = invh.todense()
inv2 = todense(invh)
inv3 = _matmat(invh, np.eye(*invh.shape))
np.testing.assert_allclose(inv1, inv2, atol=0, rtol=1e-5)
np.testing.assert_allclose(inv1, inv3, atol=0, rtol=1e-5)

%timeit -r1 invh.todense()
%timeit -r1 todense(invh)
%timeit -r1 _matmat(invh, np.eye(*invh.shape))
%timeit -r1 invh.matmat(np.eye(*invh.shape))

Benchmark result with n = 1000:

583 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
109 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
48.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
52.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)

With n = 10:

76.5 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 10,000 loops each)
77.2 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 10,000 loops each)
56.1 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 10,000 loops each)
316 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1,000 loops each)

With n = 10000:

3min 10s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
8.08 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.99 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.82 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Conclusions: the direct rewrite of todense is slower than using the matvec operation, and the existing matmat is slower than the reimplementation for small sizes because it involves a python loop (code).

I can do a PR for this if interested.

Describe the solution you'd like.

No response

Describe alternatives you've considered.

No response

Additional context (e.g. screenshots, GIFs)

No response

dschmitz89 commented 1 year ago

Sorry for the long response time. I don't understand the details but a speedup is definitely welcome!

Gattocrucco commented 1 year ago

Ok, I'll open a pull request then

ogencoglu commented 8 months ago

Can this be closed now @Gattocrucco ?

Gattocrucco commented 8 months ago

Duh this slipped my from radar.

No I haven't made the PR, and I see the code on main is still the old unoptimized one.