Closed ogrisel closed 2 years ago
/cc @lorentzenchr as it's related to the refactoring in #21808.
We should also document how sample weights are handled in the loss (in a subsection to avoid making the top level equations too complex).
Overall 👍
Concerning the loss, #21808 uses the formulation with y in [0, 1]
(or even in the interval [0,1]) which is also easier to generalize to the multinomial case. It is, however, solver dependent, which formulation is used (we do not need to mention this).
Concerning the loss, #21808 uses the formulation with y in [0, 1] (or even in the interval [0,1]) which is also easier to generalize to the multinomial case.
I am fine using either conventions for y as long as the encoding of y is made explicit just before or after the math equation. I believe the expression with y in {-1, 1}
is a bit more compact but that's a detail. Maybe the y in {0, 1}
expression is more intuitive because it directly matches the typical encoding of y we use in scikit-learn for binary classification problems.
It is, however, solver dependent, which formulation is used (we do not need to mention this).
I agree.
Adding to the list of issues, I would like to mention that reading from the documentation, it seems that the logistic loss is not scaled in the number of samples.. Is it an error in the documentation ?
Apparently, others are wondering (https://stats.stackexchange.com/questions/540624/tuning-penalty-strength-in-scikit-learn-logistic-regression?rq=1)
Working on this now in #22382 if anyone would like to give some feedback over there!
Describe the issue linked to the documentation
The current description at https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression is a bit confusing:
Suggest a potential alternative/fix
I think we should first present the 2 equations for binary logistic regression and multinomial logistic regression with l2 regularization and give the encoding of
y_i
right below each equations so as to put the binary and the multiclass case on an equal footing.The possibility to swap the l2 regularization for l1 or elastic net regularization should be moved to a dedicated subsection that would then give the formula for those regularized loss functions but only for the binary case for the sake of conciseness.
I think we should also have a subsection to make it explicit that, for the multinomial case, scikit-learn's implementation over-parametrizes the model since the
coef_
array has shape(n_classes_, n_features)
while it would be possible to alternatively use an(n_classes - 1, n_features)
parametrization as often done in the literature. We could justify the choice to use the over-parametrized formulation to preserve the symmetrical inductive bias w.r.t. the classes, which is especially important because of the penalization term.We might also have another subsection that gives the mathematical description of the prediction function, both for the binary case (with the logistic sigmoid) and the multinomial case (with the softmax function), how those prediction functions stems from the modeling choice to parameterize log ratios of conditional class probabilities with linear combination of the input features. And finally how to recover the loss functions from by taking the negative log likelihood of those functions to recover the MLE (or the MAP estimate of a Bayesian formulation when adding the penalty term).