Closed AnchorBlues closed 4 years ago
@AnchorBlues it's been a while that I dug into the math. Could you share a screenshot where it shows that the learning rate matters in the link you shared? Let me also solicit opinions from @pavanramkumar and @titipata
I think what is happening is that scikit-learn
uses an automatically determined learning rate for 'gaussian' that comes from the maximum eigen-value of (X^T)X. Can you verify that with learning_rate=0.2, how does our convergence plot look? I suspect the convergence is too slow.
@jasmainak
Could you share a screenshot where it shows that the learning rate matters in the link you shared?
The following site may be easier to understand.
http://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/08-prox-grad-scribed.pdf
The target section is 8.1.3.(Iterative soft-thresholding algorithm (ISTA)
). Here, t
is the learning rate and lambda
is the L1 regularization parameter.
As the formula (8.11) and (8.12) shows, the optimization of parameters of the Lasso is executed as follows (The same method applies to the Elastic-Net):
soft-thresholding operator
with the updated parameters.
soft-thresholding operator
is the product of the learning rate and the L1 regularization parameter.Can you verify that with learning_rate=0.2, how does our convergence plot look?
I updated the version of pyglmnet
to 1.2.dev0
to use the method plot_convergence
(the trained results was not changed) and plotted the convergence plot of the case learning_rate=0.2
.
Converged in 443 iterations.
humm ... indeed you are right. Have you verified that the fix solves the comparison problem with scikit-learn
? Would be great if you can make a pull request! Thank you so much.
@jasmainak
Have you verified that the fix solves the comparison problem with scikit-learn ?
Yes, I have verified it.
Would be great if you can make a pull request!
I made a pull request.
Please confirm.
Thanks @AnchorBlues for the careful verification and fix!
closed by #384
In the source code, the proximal operator is defined as follows:
The second argument is the threshold of the proximal operator.
As this site, the threshold should be the product of the learning rate and the L1 regularization parameter.
However, this method is called with feeding the second argument with only the L1 regularization parameter(
reg_lambda * alpha
) as follows.I think
reg_lambda * alpha
must be replaced withlearning_rate * reg_lambda * alpha
. Otherwise, the model with L1 regularization will not be correctly trained whenlearning_rate
is not1
.In fact, the trained result of
GLM
could be quite different from the one ofsklearn
when the learning rate ofGLM
is NOT1
.