scikit-learn-contrib / lightning

Large-scale linear classification, regression and ranking in Python
https://contrib.scikit-learn.org/lightning/
1.73k stars 214 forks source link

[WIP] Adding prox capability to SAGA. #38

Closed zermelozf closed 9 years ago

zermelozf commented 9 years ago

Continuing #37 after discussing with @fabianp. Added prox capability in _sag_fit of file lightning/impl/sag_fast.pyx where @fabianp left room for it.

The proximity operator is currently specified when a classifier/regressor is built with the prox keyword (type ProxFunction mimicking LossFunction in lightning/impl/sgd_fast.pyx). Not sure this is the best way to specify it by default...

Notes prox implementation breaks sparse updates and the code is excruciatingly slow on sklearn.datasets.fetch_20newsgroups_vectorized (cf. this gist)

fabianp commented 9 years ago

Thanks @zermelozf!. I'll take a closer look into this tomorrow.

mblondel commented 9 years ago

If possible, I would like to support elastic-net, since it's also supported in SDCA and Adagrad.

fabianp commented 9 years ago

@mblondel for our usercase it is important to support other non-smooth penalties besides L1, such as group lasso. Ideally it should be possible to pass our own instance of the penalty. Hence I would suggest an API that is more similar to CDClassifier than to SDCAClassifier, which only supports elastic-net penaties.

More precisely, I would like to support objectives of the form

        minimize_w  1 / n_samples * \sum_i loss(w^T x_i, y_i)
                    + alpha * 0.5 * ||w||^2_2 + beta * penalty(w)

which is the same as SAG with with the added term beta * penalty(w).

Hence from an API point of view I think we should add the following keywords to specify the regularization:

Also, right now we are defining our prox by a custom class cdef class ProxFunction but we could reuse the ones in lightning/lightning/impl/penalty.py if you are fine moving those definitions to Cython.

What do you think?

mblondel commented 9 years ago

Yep, I'm OK with your proposed approach. For the lazy updates, it might be possible to add "lazy_update" and "finalize" methods to the prox classes. For now, I would keep the classes in sag_fast.pyx and we can think of code factoring later.

zermelozf commented 9 years ago

@fabianp @mblondel, I just updated the code to be consistant with the proposed API. If the API is OK, I'll now add a bit more testing.

Note I also moved _get_loss() and _get_penalty() to _BaseSAG to avoid repetition in lightning/impl/sag_fast.pyx. Tell me if this should be reverted for semantic reasons.

fabianp commented 9 years ago

I would leave penalty to mean the non-smooth penalty term (so None by default). This way alpha corresponds always to L2 regularization and beta to the non-smooth term, as in the formulation above, and so that it coincides with SAG for penalty=None.

Note also that "l1/l2" stands for group lasso and not for elastic net (elastic net would be in this case just nonzero alpha and penalty='l1').

mblondel commented 9 years ago

It would be nice to have a naive pure Python implementation of SAGA with prox in test_saga.py. That is implement SAGA in the most straightforward way without any optimization trick (lazy update, w_scale, etc). Then compare it with the Cython implementation on a small sparse dataset (20 samples). I always use this implementation technique in my latest development and this sometimes help uncover bugs.

zermelozf commented 9 years ago

@mblondel, yes that's a good idea. I'll add it for the tests.

fabianp commented 9 years ago

you rock @zermelozf!

zermelozf commented 9 years ago

I added PySAGAClassifier to lightning/impl/tests/test_sag.py and played a bit adding tests. Four of the tests I added fail. I'm investigating this.

Interestingly test_saga_coef_equality() seems to be working. Meaning that without regularization (i.e. alpha = 0) PySAGClassifier and SAGClassifier are equivalent.

@fabianp In line 161 (sag_fast.pyx) eta_avg used. Shouldn't this be eta instead for SAGA? I tried changing this but then the classifier performs much worse... I must not understand what happens correctly.

mblondel commented 9 years ago

@zermelozf For L2 regularization, do not use the proximity operator, rather incorporate the regularizer directly in the gradient since this is the way used by the Cython implementation. The two ways are equivalent asymptotically but not step by step.

fabianp commented 9 years ago

@zermelozf yes, I think too it should be eta. Your Python code looks good to me (besides the l2 regularization that Mathieu commented).

I've tried to debug the failure in test_saga_coef_equality but couldn't find the source of the problem. I suspect it is coming from some error in the second call to _lagged_updates (the one that is saga specific) ... the crazy thing is that on the 3 first steps the updates are the same and then for some reason they start diverging ...

BTW in the tests you should be comparing pysaga.coef vs (saga.coef * saga.coefscale) , although that does not fix the test failure ...

zermelozf commented 9 years ago

Hooray! With a small modification of the update in SAG and the direct L2^2 update the Python and Cython codes are now equivalent!

I still have one test not passing for L1 regularization that makes me crazy but it shouldn't be a big deal.

The sad thing is that most of the optimisation tricks are broken with L1 reg. (Until we adapt them that is :).

fabianp commented 9 years ago

YEAH

fabianp commented 9 years ago

It should be possible to use identities relating prox(scale * x) to prox(x). See e.g. 2.2 from Proximal Algorithms .

zermelozf commented 9 years ago

Next step is to do sparse/jit updates for L1 and L1+L2^2. This is not going to be easy...

fabianp commented 9 years ago

My ideas was to delegate the scaling to the Penalty obj so that the code is still correct when f does not verify the homogeneous property. More precisely, Penalty.projection would take the w_scale as argument and deal with it as it can. Homogeneous functions would use the formula you outline but others could implement their own strategy.

zermelozf commented 9 years ago

I added just in time update for the L1 penalty. The API may not be perfect but the tests pass.

@fabianp, I also added w_scale as an argument to the prox. The signature of the function is a bit weird now since it's not used anymore but it may be useful for other proximity operators as you suggested.

fabianp commented 9 years ago

Yes, I had not thought that w_scale can be obtained from the scale_cum, which makes it redundant. Maybe its not needed after all (it would be nice however to document what scale_cum is ...).

Something is still not quite right since the test test_l1_regularized_saga does not pass when I modify beta to be something bigger than 0.1 (e.g. beta=1.)

zermelozf commented 9 years ago

Yes, you're right. I used t instead of t+1 in the lagged update. It should work now.

fabianp commented 9 years ago

Oh, one thing that we haven't checked is the stopping criterion (the violation constant). To me it looks like this would only be valid for non-composite functions (i.e. for SAG but not for general SAGA).

fabianp commented 9 years ago

As I see it, what is currently computed in violation is the squared norm of the gradient of the smooth part of the objective function. I think that for SAGA with a general prox, this should be replaced it with the squared norm of x - prox_{stepsize * beta}(x - stepsize * g_sum) (i.e. the squared norm of the gradient map). What do you think @mblondel ?

mblondel commented 9 years ago

Indeed, this is one possibility.

zermelozf commented 9 years ago

I've added a test to see if w is close to a fixed point of the prox. I left the norm of the gradient in case there is not prox provided. In both cases I reuse the violation_ratio. @fabianp, does it seem correct to you?

fabianp commented 9 years ago

Also, I would do the distinction of cases if penalty is not None: at the top, i.e. before entering the iteration for j in xrange(n_features): . This should make it more readable and you can preserve the two-liners that are currently used for the case in which penalty=None.

Also, to preserve the memory footprint for SAG, you should allocate the array w_violation_ and last_penalty_ only when penalty is not None.

zermelozf commented 9 years ago

Yes, I've updated the code with the step size and scaling correction. It seems to work now on the fetch_20newsgroups_vectorized in just a bit more time than SAG.

zermelozf commented 9 years ago

FYI, this gist comparing sparse and dense updates on the newsgroup dataset ((1000, 130107) sparse matrix) gives the following results:

SAGAClassifier

Sparse Dense
time (sec.) 1.49 393.5
score 0.232 0.236

SAGClassifier

Sparse Dense
time (sec.) 0.39 81.9
score 0.216 0.212
fabianp commented 9 years ago

Hey,

the stopping criteria now looks good to me. Thanks for the benchmarks, the comparison importance of sparse lagged updates is clear. The main difference between SAG and SAGA is because of the evaluation of the prox, setting prox=None makes it go from 1.5s to 0.66s (getting very close to SAG which is 0.44 for me). I think that the default should be prox=None since anyway beta=0. by default , (hence the prox is just slowing down the algorithm).

zermelozf commented 9 years ago

I modified the SAG constructor and put prox=None by default for SAGA. The whole (squashed) commit is in #40.

mblondel commented 9 years ago

The fact that the scores differ in the dense and sparse cases is not a good sign. They should be exactly the same. Could you set random_state=0 in SAGClassifier and SAGAClassifier and retry? By the way, lightning has a benchmarks/ folder. You can add your script there if you like.

zermelozf commented 9 years ago

@mblondel, indeed re-running with random_state=0 leads to the exact same scores in the sparse and dense cases.

As suggested by @fabianp I'll try to do a proper benchmark with distance to optimum per iteration/time soon and add it to the benchmark/ folder.