scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.17k stars 25.41k forks source link

SGDClassifier initial lr undocumented #24172

Open morrisalp opened 2 years ago

morrisalp commented 2 years ago

Describe the issue linked to the documentation

The documentation for SGDClassifier says that with learning_rate='optimal' the initial learning rate t0 is chosen with a heuristic proposed by Léon Bottou that "can be found in _init_t in BaseSGD".

However currently this method does not exist in BaseSGD.

Suggest a potential alternative/fix

The calculation of t0 should be documented either by adding the formula, pointing to the correct method in the code, and/or specifying exactly what paper of Bottou is being referred to.

ashishthanki commented 1 year ago

Hi @cmarmo / @Micky774 can i work on this issue?

Micky774 commented 1 year ago

Hi @cmarmo / @Micky774 can i work on this issue?

Sure! I think improving the existing documentation to either explain or at least present the heuristic is probably the best way to proceed.

ashishthanki commented 1 year ago

Thanks!

The default value for t_0 is not explicitly defined but is highlighted within https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_sgd_fast.pyx#L526

    if learning_rate == OPTIMAL:
        typw = np.sqrt(1.0 / np.sqrt(alpha))
        # computing eta0, the initial learning rate
        initial_eta0 = typw / max(1.0, loss.dloss(-typw, 1.0))
        # initialize t such that eta at first sample equals eta0
        optimal_init = 1.0 / (initial_eta0 * alpha)

As the documentation states for alpha - yes there is a grammatical error within the docs:

alpha` : ..._Also used to compute the learning rate when set to learning_rate is set to ‘optimal’. Values must be in the range [0.0, inf).  default=0.0001

Later on the in code:

https://github.com/scikit-learn/scikit-learn/blob/14130f44eb6cba8a2fb2eff8383be8909783cad0/sklearn/linear_model/_sgd_fast.pyx#L553

t=1 by default so t_0 would be 0. Not extremely familiar with the Plain SGD formula but with some rearranging of the formula within the code it looks like plain SGD formula within: https://leon.bottou.org/projects/sgd This is reference [7] on the doc page.

Am I missing something?

Micky774 commented 1 year ago

I think the intention is to essentially include and explain the first block of code into to the documentation. That is,

    if learning_rate == OPTIMAL:
        typw = np.sqrt(1.0 / np.sqrt(alpha))
        # computing eta0, the initial learning rate
        initial_eta0 = typw / max(1.0, loss.dloss(-typw, 1.0))
        # initialize t such that eta at first sample equals eta0
        optimal_init = 1.0 / (initial_eta0 * alpha)

should be documented in the User Guide, since t_0 := optimal_init.

ashishthanki commented 1 year ago

Thanks for clarifying @Micky774

Doing a bit of a dive in the existing references I was able to confirm the formula that is being used. However, the hyperparameter alpha is not defined in the paper nor in any of the formulas. alpha would be 0.0001 by default and there isn't anything in the SGD module that makes me think this is calculated a different way, unless you are aware? Nonetheless, it clearly is a regularising term and allows scheduling of learning rates that are asymptotically decreasing.

From what I can gather, the initial optimal learning rate would be given by:

\eta_{(0)} = \frac {\alpha^{-0.25}}{max(1.0, \frac{\partial L(-\alpha^{-0.25})}{\partial w})}

Where alpha and the selected loss function are hyperparameters chosen by the user via arguments alpha and loss, respectively.

Is this the level at which the User Guide should be, or should I try to find the reference where the above formula was taken from?

xuefeng-xu commented 1 year ago
    if learning_rate == OPTIMAL:
        typw = np.sqrt(1.0 / np.sqrt(alpha))
        # computing eta0, the initial learning rate
        initial_eta0 = typw / max(1.0, loss.dloss(-typw, 1.0))
        # initialize t such that eta at first sample equals eta0
        optimal_init = 1.0 / (initial_eta0 * alpha)

Could anyone explain why it use alpha -- the regularization parameter to set t_0 like this? Thx.

rand0wn commented 1 year ago

Here's an overview of one commonly used approach by the paper:

Set a default learning rate range: Define a reasonable range for the initial learning rate, such as [0.01, 0.1].

Conduct a coarse grid search: Start by selecting a few learning rate values within the defined range, for example, [0.01, 0.03, 0.1]. Train the model using each learning rate for a small number of iterations or epochs (e.g., 10-20) and record the corresponding performance metrics (e.g., loss, accuracy). -> This could be alpha

Analyze the performance: Examine the model's performance at each learning rate and observe how the performance changes. Look for learning rates that result in fast convergence without causing instability or poor performance.

Refine the search: Based on the results of the coarse grid search, narrow down the range of learning rates to a smaller interval that shows promising performance. For example, if [0.01, 0.03, 0.1] yielded good results, refine the range to [0.01, 0.02, 0.03].

Perform a finer grid search: Repeat the training process with the refined learning rate range, using a higher number of iterations or epochs. This helps to obtain a more accurate assessment of the model's performance at each learning rate.

Choose the optimal learning rate: Analyze the results of the finer grid search and select the learning rate that leads to the best overall performance. This could be the learning rate that achieves the fastest convergence or the highest accuracy, depending on the specific goals of your task.

It's important to note that the process of finding the initial learning rate can be iterative and may require experimentation and adjustment based on the characteristics of the dataset and model. The goal is to find a learning rate that enables stable and efficient convergence during training, leading to good generalization and performance on unseen data.

We can add some of the following points in the documentation, Thoughts @adrinjalali

adrinjalali commented 1 year ago

Thanks to @ogrisel 's help, it seems this is coming from this paper, and particularly section 5.2 this section:

image