tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
569 stars 112 forks source link

Enforce consistent PCA signs #653

Closed topepo closed 2 years ago

topepo commented 3 years ago

PCA rotations and scores are only unique up to their signs. One issue is that minor data modification might change the sign (examples in ?prcomp actually says "signs are random"). This is sometimes hard to explain to new users.

We can enforce a consistent sign by changing the sign of the entire eigenvector based on the sign of the first value of the eigenvector. In other words,

if (eigenvector[1] < 0) {
  eigenvector <- -eigenvector
}

The is done column-wise to the rotation matrix.

This would only change the sign of the PCA scores for recipes that were prepped under this change (i.e., it is backward compatible).

topepo commented 3 years ago

Davis found this bit as a precedent from Matlab:

The pca function imposes a sign convention, forcing the element with the largest magnitude in each column of coefs to be positive. Changing the sign of a coefficient vector does not change its meaning.

topepo commented 3 years ago

Here is a comparison of three strategies:

It looks like we can ensure consistent signs in the loadings but not the scores. This makes some sense since the scores are a function of all of the eigenvectors at once.

Here are the simulation results that measure the percentage of the signs and loadings (out of 50 loadings) that change when the number of rows are perturbed:

image

Neither of the strategies fully resolves the sign ambiguity of the PCA scores. I'm interested to know what @bwlewis thinks about this.

EDIT: fixed issues with computing the percent change - needed a separate data set

alexpghayes commented 3 years ago

Sorry to repeat some of what is in the slack, but:

These issues very much do come up in practice. To enforce identification when it is possible I would use something like a Procrustes or varimax rotation. If the primary issue is teaching I would create a new step step_pca_identified() that does PCA followed by a choice of Procrustes/Varimax, defaulting to Procrustes. This would be a nice thing to have in any case.

Alternatively you might just want to go all in on a step for the traditional factor analytic rotations, and then alias one of them to step_pca_identified().

It can be nice to force principle components to have positive skew (we do this in vsp) to make them look positive, but this does not resolve identification issues for distributions symmetric around zero.

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.