Improve the mathematical description of Logistic Regression

ogrisel commented 2 years ago

Describe the issue linked to the documentation

The current description at https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression is a bit confusing:

it only presents the binary case with the sigmoid logistic loss, and does not give the equation for the multiclass case with the multinomial loss function;
the fact that y_i has values in {-1, 1} is explained but should probably appear earlier;
it does not present the prediction function nor where the loss function come from.

Suggest a potential alternative/fix

I think we should first present the 2 equations for binary logistic regression and multinomial logistic regression with l2 regularization and give the encoding of y_i right below each equations so as to put the binary and the multiclass case on an equal footing.

The possibility to swap the l2 regularization for l1 or elastic net regularization should be moved to a dedicated subsection that would then give the formula for those regularized loss functions but only for the binary case for the sake of conciseness.

I think we should also have a subsection to make it explicit that, for the multinomial case, scikit-learn's implementation over-parametrizes the model since the coef_ array has shape (n_classes_, n_features) while it would be possible to alternatively use an (n_classes - 1, n_features) parametrization as often done in the literature. We could justify the choice to use the over-parametrized formulation to preserve the symmetrical inductive bias w.r.t. the classes, which is especially important because of the penalization term.

We might also have another subsection that gives the mathematical description of the prediction function, both for the binary case (with the logistic sigmoid) and the multinomial case (with the softmax function), how those prediction functions stems from the modeling choice to parameterize log ratios of conditional class probabilities with linear combination of the input features. And finally how to recover the loss functions from by taking the negative log likelihood of those functions to recover the MLE (or the MAP estimate of a Bayesian formulation when adding the penalty term).

ogrisel commented 2 years ago

/cc @lorentzenchr as it's related to the refactoring in #21808.

ogrisel commented 2 years ago

We should also document how sample weights are handled in the loss (in a subsection to avoid making the top level equations too complex).

lorentzenchr commented 2 years ago

Overall 👍 Concerning the loss, #21808 uses the formulation with y in [0, 1] (or even in the interval [0,1]) which is also easier to generalize to the multinomial case. It is, however, solver dependent, which formulation is used (we do not need to mention this).

ogrisel commented 2 years ago

Concerning the loss, #21808 uses the formulation with y in [0, 1] (or even in the interval [0,1]) which is also easier to generalize to the multinomial case.

I am fine using either conventions for y as long as the encoding of y is made explicit just before or after the math equation. I believe the expression with y in {-1, 1} is a bit more compact but that's a detail. Maybe the y in {0, 1} expression is more intuitive because it directly matches the typical encoding of y we use in scikit-learn for binary classification problems.

It is, however, solver dependent, which formulation is used (we do not need to mention this).

I agree.

ghost commented 2 years ago

Adding to the list of issues, I would like to mention that reading from the documentation, it seems that the logistic loss is not scaled in the number of samples.. Is it an error in the documentation ?

Apparently, others are wondering (https://stats.stackexchange.com/questions/540624/tuning-penalty-strength-in-scikit-learn-logistic-regression?rq=1)

Micky774 commented 2 years ago

Working on this now in #22382 if anyone would like to give some feedback over there!

scikit-learn / scikit-learn

Improve the mathematical description of Logistic Regression #21985

Describe the issue linked to the documentation

Suggest a potential alternative/fix