Closed zermelozf closed 8 years ago
Thanks @zermelozf!. I'll take a closer look into this tomorrow.
If possible, I would like to support elastic-net, since it's also supported in SDCA and Adagrad.
@mblondel for our usercase it is important to support other non-smooth penalties besides L1, such as group lasso. Ideally it should be possible to pass our own instance of the penalty. Hence I would suggest an API that is more similar to CDClassifier
than to SDCAClassifier
, which only supports elastic-net penaties.
More precisely, I would like to support objectives of the form
minimize_w 1 / n_samples * \sum_i loss(w^T x_i, y_i)
+ alpha * 0.5 * ||w||^2_2 + beta * penalty(w)
which is the same as SAG with with the added term beta * penalty(w)
.
Hence from an API point of view I think we should add the following keywords to specify the regularization:
penalty = {'l1', 'l1/l2', penalty}
non-smooth penalty term (as done in CDClassifier, should replace the prox keyword that is currently coded). Ideally it could also support passing a penalty object to support user-defined penalties. From an efficiency point of view we might need to make a distinction of cases between the L1 penalty (supporting lagged updates) and the generic user-defined penalty.beta = float
amount of regularization for the non-smooth term.alpha = float
amount of squared l2 regularization (as in SAG).Also, right now we are defining our prox by a custom class cdef class ProxFunction
but we could reuse the ones in lightning/lightning/impl/penalty.py
if you are fine moving those definitions to Cython.
What do you think?
Yep, I'm OK with your proposed approach. For the lazy updates, it might be possible to add "lazy_update" and "finalize" methods to the prox classes. For now, I would keep the classes in sag_fast.pyx and we can think of code factoring later.
@fabianp @mblondel, I just updated the code to be consistant with the proposed API. If the API is OK, I'll now add a bit more testing.
Note
I also moved _get_loss()
and _get_penalty()
to _BaseSAG
to avoid repetition in lightning/impl/sag_fast.pyx
. Tell me if this should be reverted for semantic reasons.
I would leave penalty to mean the non-smooth penalty term (so None by default). This way alpha corresponds always to L2 regularization and beta to the non-smooth term, as in the formulation above, and so that it coincides with SAG for penalty=None.
Note also that "l1/l2" stands for group lasso and not for elastic net (elastic net would be in this case just nonzero alpha and penalty='l1').
It would be nice to have a naive pure Python implementation of SAGA with prox in test_saga.py
. That is implement SAGA in the most straightforward way without any optimization trick (lazy update, w_scale, etc). Then compare it with the Cython implementation on a small sparse dataset (20 samples). I always use this implementation technique in my latest development and this sometimes help uncover bugs.
@mblondel, yes that's a good idea. I'll add it for the tests.
you rock @zermelozf!
I added PySAGAClassifier
to lightning/impl/tests/test_sag.py
and played a bit adding tests. Four of the tests I added fail. I'm investigating this.
Interestingly test_saga_coef_equality()
seems to be working. Meaning that without regularization (i.e. alpha = 0
) PySAGClassifier
and SAGClassifier
are equivalent.
@fabianp In line 161 (sag_fast.pyx
) eta_avg
used. Shouldn't this be eta
instead for SAGA? I tried changing this but then the classifier performs much worse... I must not understand what happens correctly.
@zermelozf For L2 regularization, do not use the proximity operator, rather incorporate the regularizer directly in the gradient since this is the way used by the Cython implementation. The two ways are equivalent asymptotically but not step by step.
@zermelozf yes, I think too it should be eta. Your Python code looks good to me (besides the l2 regularization that Mathieu commented).
I've tried to debug the failure in test_saga_coef_equality
but couldn't find the source of the problem. I suspect it is coming from some error in the second call to _lagged_updates (the one that is saga specific) ... the crazy thing is that on the 3 first steps the updates are the same and then for some reason they start diverging ...
BTW in the tests you should be comparing pysaga.coef vs (saga.coef * saga.coefscale) , although that does not fix the test failure ...
Hooray! With a small modification of the update in SAG and the direct L2^2 update the Python and Cython codes are now equivalent!
I still have one test not passing for L1 regularization that makes me crazy but it shouldn't be a big deal.
The sad thing is that most of the optimisation tricks are broken with L1 reg. (Until we adapt them that is :).
YEAH
It should be possible to use identities relating prox(scale * x)
to prox(x)
. See e.g. 2.2 from Proximal Algorithms .
Next step is to do sparse/jit updates for L1 and L1+L2^2. This is not going to be easy...
My ideas was to delegate the scaling to the Penalty obj so that the code is still correct when f
does not verify the homogeneous property. More precisely, Penalty.projection would take the w_scale
as argument and deal with it as it can. Homogeneous functions would use the formula you outline but others could implement their own strategy.
I added just in time update for the L1 penalty. The API may not be perfect but the tests pass.
@fabianp, I also added w_scale
as an argument to the prox. The signature of the function is a bit weird now since it's not used anymore but it may be useful for other proximity operators as you suggested.
Yes, I had not thought that w_scale
can be obtained from the scale_cum
, which makes it redundant. Maybe its not needed after all (it would be nice however to document what scale_cum is ...).
Something is still not quite right since the test test_l1_regularized_saga
does not pass when I modify beta to be something bigger than 0.1 (e.g. beta=1.)
Yes, you're right. I used t
instead of t+1
in the lagged update. It should work now.
Oh, one thing that we haven't checked is the stopping criterion (the violation
constant). To me it looks like this would only be valid for non-composite functions (i.e. for SAG but not for general SAGA).
As I see it, what is currently computed in violation
is the squared norm of the gradient of the smooth part of the objective function. I think that for SAGA with a general prox, this should be replaced it with the squared norm of x - prox_{stepsize * beta}(x - stepsize * g_sum)
(i.e. the squared norm of the gradient map). What do you think @mblondel ?
Indeed, this is one possibility.
I've added a test to see if w
is close to a fixed point of the prox. I left the norm of the gradient in case there is not prox provided. In both cases I reuse the violation_ratio
. @fabianp, does it seem correct to you?
Also, I would do the distinction of cases if penalty is not None:
at the top, i.e. before entering the iteration for j in xrange(n_features):
. This should make it more readable and you can preserve the two-liners that are currently used for the case in which penalty=None.
Also, to preserve the memory footprint for SAG, you should allocate the array w_violation_
and last_penalty_
only when penalty is not None.
Yes, I've updated the code with the step size and scaling correction. It seems to work now on the fetch_20newsgroups_vectorized
in just a bit more time than SAG.
FYI, this gist comparing sparse and dense updates on the newsgroup dataset ((1000, 130107) sparse matrix
) gives the following results:
SAGAClassifier
Sparse | Dense | |
---|---|---|
time (sec.) | 1.49 | 393.5 |
score | 0.232 | 0.236 |
SAGClassifier
Sparse | Dense | |
---|---|---|
time (sec.) | 0.39 | 81.9 |
score | 0.216 | 0.212 |
Hey,
the stopping criteria now looks good to me. Thanks for the benchmarks, the comparison importance of sparse lagged updates is clear. The main difference between SAG and SAGA is because of the evaluation of the prox, setting prox=None makes it go from 1.5s to 0.66s (getting very close to SAG which is 0.44 for me). I think that the default should be prox=None since anyway beta=0. by default , (hence the prox is just slowing down the algorithm).
I modified the SAG constructor and put prox=None
by default for SAGA. The whole (squashed) commit is in #40.
The fact that the scores differ in the dense and sparse cases is not a good sign. They should be exactly the same. Could you set random_state=0
in SAGClassifier
and SAGAClassifier
and retry? By the way, lightning has a benchmarks/ folder. You can add your script there if you like.
@mblondel, indeed re-running with random_state=0
leads to the exact same scores in the sparse and dense cases.
As suggested by @fabianp I'll try to do a proper benchmark with distance to optimum per iteration/time soon and add it to the benchmark/ folder.
Continuing #37 after discussing with @fabianp. Added prox capability in
_sag_fit
of filelightning/impl/sag_fast.pyx
where @fabianp left room for it.The proximity operator is currently specified when a classifier/regressor is built with the
prox
keyword (typeProxFunction
mimickingLossFunction
inlightning/impl/sgd_fast.pyx
). Not sure this is the best way to specify it by default...Notes prox implementation breaks sparse updates and the code is excruciatingly slow on
sklearn.datasets.fetch_20newsgroups_vectorized
(cf. this gist)