Open morrisalp opened 2 years ago
Hi @cmarmo / @Micky774 can i work on this issue?
Hi @cmarmo / @Micky774 can i work on this issue?
Sure! I think improving the existing documentation to either explain or at least present the heuristic is probably the best way to proceed.
Thanks!
The default value for t_0
is not explicitly defined but is highlighted within https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_sgd_fast.pyx#L526
if learning_rate == OPTIMAL:
typw = np.sqrt(1.0 / np.sqrt(alpha))
# computing eta0, the initial learning rate
initial_eta0 = typw / max(1.0, loss.dloss(-typw, 1.0))
# initialize t such that eta at first sample equals eta0
optimal_init = 1.0 / (initial_eta0 * alpha)
As the documentation states for alpha
- yes there is a grammatical error within the docs:
alpha` : ..._Also used to compute the learning rate when set to learning_rate is set to ‘optimal’. Values must be in the range [0.0, inf). default=0.0001
Later on the in code:
t=1
by default so t_0
would be 0. Not extremely familiar with the Plain SGD formula but with some rearranging of the formula within the code it looks like plain SGD formula within: https://leon.bottou.org/projects/sgd
This is reference [7] on the doc page.
Am I missing something?
I think the intention is to essentially include and explain the first block of code into to the documentation. That is,
if learning_rate == OPTIMAL:
typw = np.sqrt(1.0 / np.sqrt(alpha))
# computing eta0, the initial learning rate
initial_eta0 = typw / max(1.0, loss.dloss(-typw, 1.0))
# initialize t such that eta at first sample equals eta0
optimal_init = 1.0 / (initial_eta0 * alpha)
should be documented in the User Guide, since t_0 := optimal_init
.
Thanks for clarifying @Micky774
Doing a bit of a dive in the existing references I was able to confirm the formula that is being used. However, the hyperparameter alpha
is not defined in the paper nor in any of the formulas. alpha
would be 0.0001
by default and there isn't anything in the SGD module that makes me think this is calculated a different way, unless you are aware? Nonetheless, it clearly is a regularising term and allows scheduling of learning rates that are asymptotically decreasing.
From what I can gather, the initial optimal learning rate would be given by:
\eta_{(0)} = \frac {\alpha^{-0.25}}{max(1.0, \frac{\partial L(-\alpha^{-0.25})}{\partial w})}
Where alpha
and the selected loss function are hyperparameters chosen by the user via arguments alpha
and loss
, respectively.
Is this the level at which the User Guide should be, or should I try to find the reference where the above formula was taken from?
if learning_rate == OPTIMAL: typw = np.sqrt(1.0 / np.sqrt(alpha)) # computing eta0, the initial learning rate initial_eta0 = typw / max(1.0, loss.dloss(-typw, 1.0)) # initialize t such that eta at first sample equals eta0 optimal_init = 1.0 / (initial_eta0 * alpha)
Could anyone explain why it use alpha
-- the regularization parameter to set t_0
like this? Thx.
Here's an overview of one commonly used approach by the paper:
Set a default learning rate range: Define a reasonable range for the initial learning rate, such as [0.01, 0.1].
Conduct a coarse grid search: Start by selecting a few learning rate values within the defined range, for example, [0.01, 0.03, 0.1]. Train the model using each learning rate for a small number of iterations or epochs (e.g., 10-20) and record the corresponding performance metrics (e.g., loss, accuracy). -> This could be alpha
Analyze the performance: Examine the model's performance at each learning rate and observe how the performance changes. Look for learning rates that result in fast convergence without causing instability or poor performance.
Refine the search: Based on the results of the coarse grid search, narrow down the range of learning rates to a smaller interval that shows promising performance. For example, if [0.01, 0.03, 0.1] yielded good results, refine the range to [0.01, 0.02, 0.03].
Perform a finer grid search: Repeat the training process with the refined learning rate range, using a higher number of iterations or epochs. This helps to obtain a more accurate assessment of the model's performance at each learning rate.
Choose the optimal learning rate: Analyze the results of the finer grid search and select the learning rate that leads to the best overall performance. This could be the learning rate that achieves the fastest convergence or the highest accuracy, depending on the specific goals of your task.
It's important to note that the process of finding the initial learning rate can be iterative and may require experimentation and adjustment based on the characteristics of the dataset and model. The goal is to find a learning rate that enables stable and efficient convergence during training, leading to good generalization and performance on unseen data.
We can add some of the following points in the documentation, Thoughts @adrinjalali
Thanks to @ogrisel 's help, it seems this is coming from this paper, and particularly section 5.2 this section:
Describe the issue linked to the documentation
The documentation for SGDClassifier says that with
learning_rate='optimal'
the initial learning ratet0
is chosen with a heuristic proposed by Léon Bottou that "can be found in_init_t
in BaseSGD".However currently this method does not exist in BaseSGD.
Suggest a potential alternative/fix
The calculation of t0 should be documented either by adding the formula, pointing to the correct method in the code, and/or specifying exactly what paper of Bottou is being referred to.