statwangz / MATH-4432-Statistical-Machine-Learning

Tutorials for MATH 4432 Statistical Machine Learning, HKUST, Fall 2022
10 stars 6 forks source link

Can you plz explain why curse of dimensionality holds for KNN, but not for standard linear regression? #1

Closed aorazalin closed 2 years ago

aorazalin commented 2 years ago

Slide 53 of LinearModel.pdf says that as p (# of params) increases, linear regression's MSE increases which is obvious (more noise). But I don't get the following:

  1. Why KNN increases much more that Linear regression? I presume it means MSE of KNN increases exponentially as p increases, but why? I didn't quite understand the explanation at https://en.wikipedia.org/wiki/Curse_of_dimensionality#Nearest_neighbor_search. I found this page https://stats.stackexchange.com/questions/159070/curse-of-dimensionality-knn-classifier that sort of explains the curse of dimensionality with an example saying that you need to look further to look for closest K points & those points will be further away from your "locus". How does that make MSE ~ O(c^p)?
  2. Why it is not the case for linear models? I would appreciate a more rigorous approach (or an intuitive one & a link for a related paper). Thank you
statwangz commented 2 years ago

Good question! I think you can find the answer in Chapter 3.5 in our reference book (An Introduction to Statistical Learning). The K observations that are nearest to a given test observation x0 may be very far away from x0 in p-dimensional space when p is large, leading to a very poor prediction of f(x0) and hence a poor KNN fit. Typically, your sense of distance can vary greatly in high and low-dimensional spaces. In a high-dimensional space, some "similar" data points may be far away from each other.

Best, Zhiwei

statwangz commented 2 years ago

A more intuitive explanation is that when applying KNNs, whether beneficial or not, each dimension plays the same role in computing distances (e.g., Euclidean distance) because they have the same weights. But for linear regression, in each dimension, the value of the coefficient depends on the correlation of the predictor and the response variable.

Best, Zhiwei

aorazalin commented 2 years ago

Thank you!