telmo-correa / all-of-statistics

Self-study on Larry Wasserman's "All of Statistics"
990 stars 279 forks source link

Mistake in risk estimation #36

Open SergeyPetrakov opened 1 year ago

SergeyPetrakov commented 1 year ago

Hello! Thank you for a great work! It seems like a mistake in Chapter 21 - Nonparametric Curve Estimation in function j_hat_kde, since as mentioned in the book and in your text K^(2) is N(0,2) the function should look like (also the comment contains mistake too, h is a bandwidth), however the comment contains the information that dataset is rescaled to [0, 1], but I did not find any evidence for such transformation in the book as well as code of this function does not contain it, please could you double check this:

def j_hat_kde(X, h):
    """
    Calculate the approximated estimated KDE risk J_hat for a N(0, 1) Gaussian kernel

      \hat{J}(h) = \frac{1}{hn^2}\sum_{i, j} K^* \left( \frac{X_i - X_j}{h} \right) + \frac{2}{nh} K(0)

    where:
      n is the dataset size
      h is the bandwidth for the rescaled [0, 1] dataset
      K^* is K^{(2)}(x) - 2 K(x), and K^{(2)} is the convolved kernel, K^{(2)}(z) = \int K(z - y) K(y) dy
      K is the original kernel
    """
    n = len(X)
    Kstar_args = np.array([X.iloc[i] - X.iloc[j] for i, j in product(range(n), range(n))]) / h
    sum_value = np.sum(norm.pdf(Kstar_args, loc=0, scale = 2) - 2 * norm.pdf(Kstar_args, loc=0, scale = 1))
    return sum_value / (h * n * n) + 2 * norm.pdf(0) / (n * h)