rmclarke / SeriesOfHessianVectorProducts

This repository contains code for the paper Series of Hessian-Vector Products for Saddle-Free Newton Optimisation of Neural Networks, submitted to TMLR.
2 stars 0 forks source link

Algorithm 1 implementation #1

Closed iamgroot42 closed 4 months ago

iamgroot42 commented 6 months ago

Hello,

First off, thanks a lot for such well-documented and neat-code. I am extending to run this in PyTorch and ran into no hiccups at any point, thanks to the well-written code and clearly-explained algorithms in the paper!

I implemented and tested this iHPV method on a simple 2-layer MLP and am comparing it with the exact iHVP to see how close the result here is, but it seems to be quite off, even when run for a lot of (1000) iterations.

For reference, here is my implementation in Pytorch

def compute_epsilon_acceleration(
    source_sequence,
    num_applications: int=1,):

    def inverse(vector):
        # Samelson inverse
        return vector / vector.dot(vector)

    epsilon = {}
    for m, source_m in enumerate(source_sequence):
        epsilon[m, 0] = source_m
        epsilon[m + 1, -1] = 0

    s = 1
    m = (len(source_sequence) - 1) - 2 * num_applications
    initial_m = m
    while m < len(source_sequence) - 1:
        while m >= initial_m:
            # Sablonniere modifier
            inverse_scaling = np.floor(s / 2) + 1

            epsilon[m, s] = epsilon[m + 1, s - 2] + inverse_scaling * inverse(
                epsilon[m + 1, s - 1] - epsilon[m, s - 1]
            )
            epsilon.pop((m + 1, s - 2))
            m -= 1
            s += 1
        m += 1
        s -= 1
        epsilon.pop((m, s - 1))
        m = initial_m + s
        s = 1

    return epsilon[initial_m, 2 * num_applications]

@ch.no_grad()
def hso_hvp(vec,
            hvp_module,
            acceleration_order: int,
            initial_scale_factor: float = 100,
            num_update_steps: int = 20,):

    # Detach and clone input
    vector_cache = vec.detach().clone()
    update_sum = vec.detach().clone()

    cached_update_sums = []
    if acceleration_order > 0 and num_update_steps == 2 * acceleration_order + 1:
        cached_update_sums.append(update_sum)

    # Do HessianSeries calculation
    for update_step in range(1, num_update_steps):
        # When i == 1, cache we computed for first step corresponds to H2g anyway
        hessian2_vector_cache = hvp_module.hvp(hvp_module.hvp(vector_cache))
        # So can update V accordingly
        if update_step == 1:
            scale_factor = ch.norm(hessian2_vector_cache, p=2) / ch.norm(vec, p=2)
            scale_factor = max(scale_factor.item(), initial_scale_factor)

        vector_cache = (vector_cache
                            - (1/scale_factor)*hessian2_vector_cache)
        scaling_factor = (2 * update_step - 1) / (2 * update_step)
        vector_cache *= scaling_factor
        update_sum += vector_cache

        if acceleration_order > 0 and update_step >= (num_update_steps - 2 * acceleration_order - 1):
            cached_update_sums.append(update_sum.clone())

    update_sum /= np.sqrt(scale_factor)

    # Perform series acceleration (Shanks acceleration)
    if acceleration_order > 0:
        accelerated_sum = compute_epsilon_acceleration(
            cached_update_sums, num_applications=acceleration_order
        )
        accelerated_sum /= np.sqrt(scale_factor)
        return accelerated_sum

    return update_sum

Here, hvp_module.hvp is a utility function that returns the HVP (and is verifiably correct- I tested it independently with the exact Hessian as well). I also tried replacing the actual Hessian with a artificially-constructed positive semi-definite matrix, but even then the computed answer is off by a lot.

On that note, the implementation for Algorithm 2 uses epsilon[m+1, s-2] correctly https://github.com/rmclarke/SeriesOfHessianVectorProducts/blob/21db435dd7d6f439b5c5174fe72d9916db597e75/util.py#L131 but in the paper epsilon[m+1, s-2] is referenced incorrectly.

iamgroot42 commented 6 months ago

From what I have concluded, there isn't anything wrong about the implementation itself, but just the way the Hessian's values are distributed. In most cases that I tried, the condition number of the squared Hessian is really large which basically means (I - H^2/V) stays really close to 1 (and because of limited precision, ends up being really close to 1). This isn't fixed by choosing an arbitrarily large V or a "tight" V (by computing the exact Hessian), since the ones in the identity matrix's diagonals still contribute the 1 where the corresponding eigenvalues are exceedingly small.

rmclarke commented 5 months ago

Hi - thanks for getting in touch, and sorry that we haven't got back to you sooner!

Thanks for flagging the typo in Algorithm 2 in our paper - we'll get that fixed.

We also noticed a substantial gap between the power series approach and the true inverse-Hessian product, which matches our empirical results from Figure 10 in which this convergence, even in a lower-dimensional problem, is quite slow. The ill-conditioning you describe seems plausible, since we're aware of results that show very small eigenvalues become significantly more prevalent in higher dimensions. If memory serves, we briefly investigated a Levenberg-Marquardt-like damping coefficient to mitigate this problem, but couldn't immediately derive a series incorporating it.

Ultimately, our strategy was to design a matrix transformation which predictably changes the Hessian's eigenvalues without having to explicitly decompose the Hessian, then design a (possibly approximate) implementation of that transformation which avoids any explicit representation of the Hessian. So we hope there are better ways of doing that than we've managed here, which would then give better behaviour! But as you say, small eigenvalues are a bit of a problem in the current approach.

iamgroot42 commented 4 months ago

Thanks for the clarification and your detailed response!