Open stevenlis opened 4 years ago
I never seriously used FA, so I'm not sure what the workflow is
Here are two notebooks from the time of the merge when I tried out different functionality. I'm not sure what the status is
https://gist.github.com/josef-pkt/1da753c160a054c9f342e735ee9be05e https://gist.github.com/josef-pkt/5e4f652c732978ce0e18f85164a29a14
github doesn't show cell background colors here's the nbviewer version for the styling notebook https://nbviewer.jupyter.org/gist/josef-pkt/5e4f652c732978ce0e18f85164a29a14
@josef-pkt Thanks for the notebook.
I just tried with the examples included in the notebook, and I was not able to repro the example in this file: https://www.stata.com/manuals13/mvfactor.pdf
use http://www.stata-press.com/data/r13/bg2, clear
factor bg2cost1-bg2cost6, ml
rotate
(Physician-cost data)
(obs=568)
number of factors adjusted to 3
Iteration 0: log likelihood = -28.605724
Iteration 1: log likelihood = -.66919454
Iteration 2: log likelihood = -.00634396
Iteration 3: log likelihood = -6.465e-06
Iteration 4: log likelihood = -3.342e-12
Factor analysis/correlation Number of obs = 568
Method: maximum likelihood Retained factors = 3
Schwarz's BIC = 95.1318
Log likelihood = -3.34e-12 (Akaike's) AIC = 30
--------------------------------------------------------------------------
Factor | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Factor1 | 1.07659 0.31904 0.5229 0.5229
Factor2 | 0.75755 0.53268 0.3679 0.8908
Factor3 | 0.22487 . 0.1092 1.0000
--------------------------------------------------------------------------
LR test: independent vs. saturated: chi2(15) = 269.07 Prob>chi2 = 0.0000
(the model with 3 factors is saturated)
Factor loadings (pattern matrix) and unique variances
-----------------------------------------------------------
Variable | Factor1 Factor2 Factor3 | Uniqueness
-------------+------------------------------+--------------
bg2cost1 | 0.2315 0.3840 -0.1662 | 0.7713
bg2cost2 | -0.4172 0.3811 -0.2541 | 0.6161
bg2cost3 | -0.4431 0.4662 0.1354 | 0.5680
bg2cost4 | -0.3133 0.2014 0.2578 | 0.7948
bg2cost5 | 0.5564 0.3600 0.1749 | 0.5302
bg2cost6 | 0.4948 0.2780 -0.1315 | 0.6606
-----------------------------------------------------------
Factor analysis/correlation Number of obs = 568
Method: maximum likelihood Retained factors = 3
Rotation: orthogonal varimax (Kaiser off) Number of params = 15
Schwarz's BIC = 95.1318
Log likelihood = -3.34e-12 (Akaike's) AIC = 30
--------------------------------------------------------------------------
Factor | Variance Difference Proportion Cumulative
-------------+------------------------------------------------------------
Factor1 | 0.91912 0.05115 0.4464 0.4464
Factor2 | 0.86797 0.59607 0.4216 0.8679
Factor3 | 0.27190 . 0.1321 1.0000
--------------------------------------------------------------------------
LR test: independent vs. saturated: chi2(15) = 269.07 Prob>chi2 = 0.0000
(the model with 3 factors is saturated)
Rotated factor loadings (pattern matrix) and unique variances
-----------------------------------------------------------
Variable | Factor1 Factor2 Factor3 | Uniqueness
-------------+------------------------------+--------------
bg2cost1 | 0.3902 0.0995 0.2579 | 0.7713
bg2cost2 | -0.1251 0.4910 0.3567 | 0.6161
bg2cost3 | -0.0324 0.6565 0.0050 | 0.5680
bg2cost4 | -0.0772 0.4053 -0.1869 | 0.7948
bg2cost5 | 0.6784 -0.0497 -0.0845 | 0.5302
bg2cost6 | 0.5329 -0.1389 0.1900 | 0.6606
-----------------------------------------------------------
Factor rotation matrix
-----------------------------------------
| Factor1 Factor2 Factor3
-------------+---------------------------
Factor1 | 0.7707 -0.6368 -0.0230
Factor2 | 0.6183 0.7386 0.2687
Factor3 | 0.1541 0.2213 -0.9630
-----------------------------------------
from statsmodels.datasets import webuse
from statsmodels.multivariate.factor import Factor
df = webuse(data='bg2')
endog = [col for col in df.columns if col.startswith('bg2')]
model = Factor(endog=df[endogs], n_factor=3, method='ml')
results = model.fit()
results.rotate('varimax')
print(results.summary())
actor analysis results
======================================================
------------------------------------------------------
Communality
------------------------------------------------------
bg2cost1 bg2cost2 bg2cost3 bg2cost4 bg2cost5 bg2cost6
------------------------------------------------------
0.2287 0.3839 0.4320 0.2052 0.4698 0.3394
------------------------------------------------------
------------------------------------------------------
Pre-rotated loadings
---------------------------------------------------------------
factor 0 factor 1 factor 2
---------------------------------------------------------------
bg2cost1 0.2315 -0.1662 0.3840
bg2cost2 -0.4172 -0.2541 0.3811
bg2cost3 -0.4431 0.1354 0.4662
bg2cost4 -0.3133 0.2578 0.2014
bg2cost5 0.5564 0.1749 0.3600
bg2cost6 0.4948 -0.1315 0.2780
------------------------------------------------------
------------------------------------------------------
varimax rotated loadings
---------------------------------------------------------------
factor 0 factor 1 factor 2
---------------------------------------------------------------
bg2cost1 0.3902 -0.2579 0.0995
bg2cost2 -0.1251 -0.3565 0.4911
bg2cost3 -0.0324 -0.0048 0.6565
bg2cost4 -0.0772 0.1870 0.4053
bg2cost5 0.6784 0.0844 -0.0497
bg2cost6 0.5329 -0.1901 -0.1389
======================================================
The interesting thing is the second and third factors are switched (some loadings have different signs) in the results.
I'm too far away from this right now to figure out the details.
two things, AFAIR statsmodels uses a different algorithm and not maximum likelihood. I do't know how much effect this has
some loadings have different signs
Some of the linear algebra doesn't define a unique representation, and depend on the underlying linalg implementation. For example, eigenvectors can have arbitrary signs (eigenvector multiplied by -1)
In some cases, I had to standardize the signs in order to get reproducible signs across packages. But I never checked this for FA, and don't know if there is a place where we should fix the sign.
unit test has a comment: "# Note: Stata has second factor with flipped sign compared to statsmodels" but I think I just assumed that is arbitrary and didn't investigate. unit tests adjust the sign to match stata
I don't remember details but I compared some parts with Stata https://github.com/statsmodels/statsmodels/pull/4167#issuecomment-353184422
We do have ML, it is based on the Bai and Li paper which may use slightly different normalizations. There are tests that pass against R factanal.
(I thought we were on the mailing list)
my mistake of not keeping up and reading carefully enough, method='ml'
method pa sorts loadings by size of eigenvalues.
Based on a quick look method='ml'
might have in some cases random ordering of loading columns, IIUC
The starting values are created from a random initial loading matrix _fit_ml_em
AFAICS, there is no sorting to get a deterministic sequence of loadings.
or do I miss something?
The doc:
The method to extract factors, currently must be either ‘pa’ for principal axis factor analysis or ‘ml’ for maximum likelihood estimation.
Is this a typo or maybe I missed something:
Btw, the Factor
should list under multivariate right @bashtage
The method to extract factors, currently must be either ‘pa’ for principal axis factor analysis or ‘ml’ for maximum likelihood estimation
The doc is correct. method='ml' just slipped my mind because I never played with it.
Yes, should probably be moved.
On Fri, Feb 28, 2020 at 7:35 PM StevenLi-DS notifications@github.com wrote:
The method to extract factors, currently must be either ‘pa’ for principal axis factor analysis or ‘ml’ for maximum likelihood estimation.
Is this a typo or maybe I missed something:
Btw, the Factor should list under multivariate right @bashtage https://github.com/bashtage
[image: Snipaste_634] https://user-images.githubusercontent.com/33796896/75581389-5d1b6900-5a37-11ea-9c9e-6d7f4a2ede7f.png
[image: Snipaste_361] https://user-images.githubusercontent.com/33796896/75581390-5d1b6900-5a37-11ea-834c-6aaa7735bd05.png
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/6559?email_source=notifications&email_token=ABKTSROOBBRNI5ZRKQ345H3RFFROZA5CNFSM4K5I6QQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJ4AGI#issuecomment-592691225, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKTSRNSYT6OOC4HXVTNP5LRFFROZANCNFSM4K5I6QQQ .
Please consider adding an example for factor analysis along with PCA in the doc.
https://www.princeton.edu/~otorres/Factor.pdf A general workflow in Stata would be:
I tried with
statsmodels.multivariate.factor.Factor
but I was not able to repro my workflow and the.loadings()
didn't even return anything.Am I missing something?