macss-modeling / General-Questions

A repo to post questions about code, data, etc.
0 stars 0 forks source link

The zero mean assumptions of PCA #24

Closed jinfei1125 closed 3 years ago

jinfei1125 commented 3 years ago

Hi, I have a question about the zero mean assumption of PCA--it assumes the sum of each column of the X matrix are all zero, which is often not the case, as far as I am concerned. For example, for the USArrest example in ISL, both the number of police in a city and the number of Assult crimes should not have a mean of zero (they should be at least positive). But in the formula to calculate the proportion of variance explained, we still hold the assumption of zero mean. Should we normalize our data to a standard normal distribution before we conduct PCA? (because at least we need to scale each variable to have standard eviation one before we perform PCA. Should we also standardize the mean instead of just assuming the mean is 0?)

(a screenshot from page 382 of ISL) image

Thanks in advance!

(sorry I should have asked this question last week. I have been thinking about it for a while but now I still don't have the answer)

pdwaggoner commented 3 years ago

Thanks for the question. Data should always be mean centered in PCA implementations given one of the two requirements of PCA: mean center the data (to allow each PC to pass through the origin) and ensure all PCs are orthogonal. Given this requirement, most (if not all) PCA implementations/functions/software should default to ensure data are centered first. Indeed, this is the case with the function we used in class prcomp() (see defaults here: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp), and this is what the ISL authors use in the text too. Just in case, we also use scale() from base R, which has a default of mean centering the data (see here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/scale). So this is why the authors say "assuming that the variables have been centered to have mean zero...", because in fact they must if we are to use PCA correctly. I hope this helps!

jinfei1125 commented 3 years ago

Got it! Thanks for the explanation!