Vectorizing computation to simultaneously solve several minimization problems

duchesnay commented 7 years ago

I realized that we could make an impressive gain of computation time by vectorizing the computation to simultaneously solve several minimization problems. Instead of having on l1, l2 tv scalars values we could provide vectors of length k to simultaneously solve k minimization problems. If we provide l1 = [0.1, 0.9], l2=[0.9, 0.1], tv = [0.2, 0.8], we would solve two problems: one with l1, l2, tv = 0.1, 0.9, 0.2 and one with l1, l2, tv = 0.9, 0.1, 0.8. We would simultaneously estimate k beta vectors and model.predict(X) will then provide k prediction. This would be the only consequence on the global API.

First, this is useful for multinomial classification problems.

Second, such framework should considerably speed up the computation over a grid of parameters. The rationale is the following, most of the computation is spent in the np.dot(X, beta). My hypothesis was that the time-consuming procedure is to load X data int the L3 cache of the processor. I realized that by observing a considerable slowdown of the computation when using several cores of the same processor. Using the 8 cores of the same processor slow down the minimization by a factor of 5! It is a very poor scaling! So if the same X data is used for several problems (beta is p x k array instead of being a p x 1) we could simultaneously solve several problems with a minimum overhead. Test it with the following code:

import time
import matplotlib.pylab as plt

n_features = int(550**2)
n_samples = 200

X = np.random.randn(n_samples, n_features)
beta = np.random.randn(n_features, 1)
y = np.dot(X, beta)

sizes = np.arange(1, 101)
elapsed = np.zeros(sizes.shape[0])

for s in sizes:
    beta = np.random.randn(X.shape[1], s)
    t_ = time.clock()

    # regression loss pattern
    grad = np.dot(X.T, np.dot(X, beta) - y)
    elapsed[s-1] = time.clock() - t_

# plt.plot(sizes, elapsed)
#plt.plot([1, sizes[-1]], [elapsed[0], elapsed[0]*sizes[-1]])

plt.plot(sizes, elapsed / elapsed[0])
plt.xlabel("beta.shape[1] (in X * beta)")
plt.ylabel("Ratio of CPU time: time(X beta) / time(X beta.shape[1]==1) ")

I observed an acceleration factor of 10 (anaconda with MKL single thread). Solving 50 problems beta.shape = [p, 50] is only 5 times slower than solving a single problem.

Most of the code should smoothly move to a vectorized version since beta is a [p x 1] vector and can be easly extend to a [p x k]. See an example of FISTA:

z = betanew + ((i - 2.0) / (i + 1.0)) * (betanew - betaold)
step = function.step(z)

betaold = betanew
betanew = function.prox(z - step * function.grad(z),
                                    step,
                                    eps=1.0 / (float(i) ** (4.0 + consts.FLOAT_EPSILON)),
                                    max_iter=self.max_iter)

Most of change will ocure when testing for convergences: gap_mu + mu * gM < self.eps will become np.min(gap_mu + mu * gM) < self.eps

I suggest to start with FISTA and elasticnet, and evaluated the acceleration.

tomlof commented 7 years ago

This is cool! I agree that likely the required changes will be minimal, while the gain will be maximal.

Three comments:

I guess the convergence criterion should be np.max(gap_mu + mu * gM) < self.eps, so that the algorithm stops when all have converged, instead of when the first beta vector is converged.
I think the parameters for l1, l2 and tv should be allowed to be either ints or lists of ints, so that the current syntax (with ints) still works.
Perhaps we could allow the cartesian product of the lists? So that with l1=[0.1, 0.9] and tv=[0.1, 1.0], instead of computing for l1,tv=0.1,0.1 and l1,tv=0.9,1.0, we compute for l1,tv=0.1,0.1, l1,tv=0.1,1.0, l1,tv=0.9,0.1 and l1,tv=0.9,1.0. This could be an option.

duchesnay commented 7 years ago

Experiment en 302,500 features (image 550x550) has demonstrated an acceleration factor of 4.5 Solving 40 simultaneously problems is only 8.5 times slower.

See: classif_550x550_dataset_enet_1000ite_acceleration.pdf

neurospin / pylearn-parsimony

Vectorizing computation to simultaneously solve several minimization problems #19