Source of randomness in logistic regression

I would like to suggest to elaborate a little bit on the randomness in logistic regression when compared to linear regression. This is mentioned on page 45, lines -3 to -1 of the Draft (April 30, 2021). I think the sentence "the randomness in classification is statistically modeled by the class probability construction 𝑝(𝑦 = 𝑚 | x) instead of an additive noise 𝜀" may not be enough for the reader who reads about logistic regression for the first time. It would help to mention at least that the distribution of class labels is Bernoulli(g(x)). Then the source of randomness becomes clearer. You do mention this on page 54: "In binary logistic regression the output distribution 𝑝(𝑦 | x; 𝜽) is a Bernoulli distribution" but that is too far from the discussion of randomness in logistic regression.

If you ever consider to add exercises to your wonderful book then let me suggest one. It was highly insightful for me when I first simulated a dataset for the binary logistic regression. This practically shows where the randomness is. Here is my suggested Python code for the exercise:

import numpy as np
from sklearn.linear_model import LogisticRegression

def logistic(z):
    return 1 / (1 + np.exp(-z))

# Set random seed.
np.random.seed(0)
# True theta coefficients.
theta = np.array([4, -2])
# Number of training data points.
n = 100000
# Number of features.
p = len(theta)
# Generate feature values from U[0,1].
X = np.random.rand(n, p)
# Calculate logits.
z = X @ theta.reshape(-1, 1)
# Calculate probabilities.
prob = logistic(z)
# Generate labels by sampling from Bernoulli(prob)
y = np.random.binomial(1, prob.flatten())
# Train a Logistic regression model.
clf = LogisticRegression(fit_intercept = False, penalty = "none").fit(X, y)
# Check the coefficients - should be close to the true values.
print(f"Learnt theta: {np.round(clf.coef_, 2)} (true theta was {theta})")

# Out: Learnt theta: [[ 4.01 -2.  ]] (true theta was [ 4 -2])

uu-sml / sml-book-page

Source of randomness in logistic regression #65