FIX: correct step-size for strongly convex functions

scikit-learn-contrib / lightning

Large-scale linear classification, regression and ranking in Python

https://contrib.scikit-learn.org/lightning/

1.73k stars 214 forks source link

FIX: correct step-size for strongly convex functions #67

Closed fabianp closed 8 years ago

fabianp commented 8 years ago

This is the same fix as PR #53 . I'm re-opening it because I think that it is in fact correct.

Our formulation of SAG(A) is minimize_w 1 / n_samples * \sum_i loss(w^T x_i, y_i)

alpha * 0.5 * ||w||^2_2

to put all into the sum we need to multiply the alpha by n yielding

    minimize_w  1 / n_samples * (\sum_i loss(w^T x_i, y_i)
                + n_samples * alpha * 0.5 * ||w||^2_2)

and so the Lipschitz constant should be L + n * alpha . This is consistent with https://hal.inria.fr/hal-00860051/document page 21 ("Section 4.5 ...")

fabianp commented 8 years ago

@mblondel it would be good to have this into the release as I believe it does fix a bug

mblondel commented 8 years ago

Personally, I get the following formula:

inv_stepsize = 1 / n_samples * lipschitz_constant * L + alpha
stepsize = 1.0 / inv_stepsize

where lipschitz_constant is 1 for squared loss, 0.25 for log loss and 2 for squared hinge. Trying it for specifc values of n_samples, L and alpha, it seems to lead to larger step sizes.

Another remark is that the branch for the log loss is confusing because it doesn't follow the same pattern as for other losses. It would be better to define the lipschitz constant in the conditions and use a common formula at the end. Something like this:

if loss == "log":
     lipschitz_constant = 0.25
elif loss == "squared":
    lipschitz_constant = 1.0
[...]
step_size = 1.0 / (... lipschitz_constant ...)

fabianp commented 8 years ago

@mblondel what is L in your formula? Note that the Lipschitz constant should be the maximum of the Lipschitz constants for all the "f_i" (objective function = (1/n) Sum_i f_i) . It cannot go to zero as the number of samples increases (assuming alpha=0) as in your formula.

mblondel commented 8 years ago

L = get_max_squared_sum(X)

fabianp commented 8 years ago

OK, anyway I think that your inv_stepsize should be multiplied by n_samples (since as I said the inv_stepsize should be the max Lipschitz constant for the f_i ).

mblondel commented 8 years ago

I am a bit confused. Maybe L in my formula should be the sum of squared norms instead of the max. Anyway, I don't have enough time to carefully check so we can go with you formula. In the future, it might be interesting to add the derivation to the doc.

One thing to remember is that there are several ways to derive Lipschitz constants, only some are tighter than others :).

It would still be interesting to address my second remark.

fabianp commented 8 years ago

I forced pushed a version with the reworked logic here ... I agree to take the fall for this if it is again wrong* :-)

likely we'll come back to this at some point since authors don't seem to agree which stepsize is better between 1/lipschitz, 2/lipschitz etc.. Anyway I think that this is an improvement over the previous version.