jschulberg / Dog-Returns

A data science analysis to classify whether or not an adopted dog will be returned.
0 stars 0 forks source link

Can we use PCA on Indicator Variables? #23

Closed jschulberg closed 2 years ago

jschulberg commented 2 years ago

From StackOverflow:

While you can use PCA on binary data (e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well.

PCA is designed for continuous variables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have binary variables.

So yes, you can use PCA. And yes, you get an output. It even is a least-squared output: it's not as if PCA would segfault on such data. It works, but it is just much less meaningful than you'd want it to be; and supposedly less meaningful than e.g. frequent pattern mining.

jschulberg commented 2 years ago

From this post, it looks like it's actually a popular technique to One-Hot Encode (OHE) and then apply PCA. It's not as meaningful as doing PCA on continuous variables, but because OHE tends to substantially increase the dimensionality of our data, PCA helps pare down the number of variables we end up having.

jschulberg commented 2 years ago

PCA doesn't seem to be working too well...it may be worthwhile to use the prince package to attempt Multiple Correspondence Analysis (MCA), which is used for multiple categorical features, instead. Or we could try Multiple Factor Analysis (MFA), which works on a combination of both continuous/categorical features.