Cache expensive computation that are computed twice

duchesnay commented 7 years ago

80% of the computation time is spent in the np.dot(X, beta) of the loss and it is done twice in the gradient and in the gap inside the same iteration. We could cache this result. Memory cache solution such as joblib exists (the Memory class) but it provides complex and unnecessary features:, ie, storing on disk, long term memorizing that will slow down the performances. We only need a "short" time memory with only need a one step back cache. Here are my proposed specifications:

One step back cache mechanisms: only the last result is memorized avoiding the computation with a second call with the same parameters.
We assume that all parameters can be converted into numpy array (this is for sake of simplicity). Such constraint can be released latter using a more complex mechanism.
Equality of parameters are tested based on the values. The same memory chunk that has been modified will fail and different memory chunks with same values will success.
The check of some parameters can be avoided, typically, the constant large dataset or the linear operator.

The pros arguments are

Simplicity
The gain in execution time, that should me measured, but I assume between 30% and 40% of gain if 80% is spent in np.dot(X, beta).
Negligible impact on the code of Parsimony. The cache object will be instanciated in the constructor of the function to be minimized, (eg, LinearRegressionL1L2TV, etc. ). The cache instance will be given to the loss. Then, only a few lines where np.dot(X, beta) will be touched.

Bellow an example of implementation

import numpy as np
import hashlib

class CacheOneStepBack:
    def __init__(self):
        self.memory = dict()

    def cache(self, func, *args, checkargs=None):
        """Cache result of func(*args). This is one step back memory ie, only the
        last result is memorized avoiding the computation with a second call with
        the same parameters. Parameters a compared based of their values. Assume
        args can be converted into numpy arrays.

        Parameters
        ----------
        func : Python function

        *args: Arguments of the python function

        checkargs: Optional, list of index of argument to be checked. Default is
            None, ie, all parameters are checked. Use this to avoid checking of
            aparameters that are known to be constant, ie, large data matrix.
            This should be used carefully because it raises risks of errors.

        """
        if checkargs is None:
            checkargs = list(range(len(args)))
        key = str(func)
        hashcodes = [hashlib.sha1(np.asarray(args[i])).hexdigest()
            if i in checkargs else None for i in range(len(args))]

        if not key in self.memory:
            self.memory[key] = [hashcodes, func(*args)]
        else:
            #same = [CACHE_ONE_STEP_BACK[key][0][i] == hashs[i] for i in range(len(hashs))]
            same = [self.memory[key][0][i] == hashcodes[i] for i in range(len(args))]
            if not np.all(same):
                self.memory[key] = [hashcodes, func(*args)]
        return self.memory[key][1]

memory = CacheOneStepBack()

# Test with floats and integers
# All arguments are checked, no risk of confusion.
def add(a, b, c):
    print("Call add", a, b, c)
    return a + b + c

memory.cache(add, 1.0, 2.0, 3) # first computation
memory.cache(add, 1.0, 2.0, 3) # Use cached results
memory.cache(add, 1.0, 2.0, 3.0)  # recompute hash 3.0 != hash 3
memory.cache(add, 1., 2., 4) # re-compute

# Test with numpy arrays
X = np.random.rand(150, int(5e5))
beta = np.random.rand(int(5e5), 1)
betaref = beta
betacpy = beta.copy()

# Here we avoid checking X
%time a1 = memory.cache(np.dot, X, beta, checkargs=[1]) # first computation (~454 ms)
%time a2 = memory.cache(np.dot, X, betaref, checkargs=[1]) # use cache, same data chunck (~12.2 ms)
%time a3 = memory.cache(np.dot, X, betacpy, checkargs=[1]) # use cache (~12.5 ms)
assert np.all(a1 == a2) and np.all(a1 == a3)
beta[0] = 33
%time a1 = memory.cache(np.dot, X, beta, checkargs=[1]) # re-compute (~454 ms)
%time a2 = memory.cache(np.dot, X, betaref, checkargs=[1]) # use cache, same data chunck (~12.2 ms)
%time a2 = memory.cache(np.dot, X, betacpy, checkargs=[1]) # recompute (~387 ms)

tomlof commented 7 years ago

This is a great idea! I think we should use it!

I have some comments: I think we should add a parameter (e.g., a bool use_cache) to the constructor of the estimator and the loss function that will use this, so that it can be turned off by those that know what it means to turn it off.

We can add a TODO-comment near the sha1 computation so that we can think about clever ways to speed up the hash computation. For instance, it might be unnecessarily slow to recompute the sha1 hash of X every time if we know that only the hash of beta has changed.

There is also another really simple solution to this problem: In functions.properties.Gradient, there is a non-abstract function f_grad, that can be used instead of calling grad and f separately. This means that np.dot(X, beta) can be computed only once in f_grad instead of twice, as when calling first grad and then f. This requires a minor change in the algorithms, though: Check once if f_grad is implemented, and if it is, use it if we need to compute both f and grad simultaneously. This is trivial, but requires some extra logic in every algorithm.

duchesnay commented 7 years ago

+1 for use_cache

The checkargs indicates which argument should be checked (with sha1 hash), to avoid unnecessary and slow check of other arguments such X

Using f_grad instead of f + grad is interesting. However, we still have np.dot(X, beta) in the gap... Moreover, this will impact the logic in every algorithm...

neurospin / pylearn-parsimony

Cache expensive computation that are computed twice #18