statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.1k stars 2.88k forks source link

[DOC] add an example for factor analysis #6559

Open stevenlis opened 4 years ago

stevenlis commented 4 years ago

Please consider adding an example for factor analysis along with PCA in the doc.

https://www.princeton.edu/~otorres/Factor.pdf A general workflow in Stata would be:

I tried with statsmodels.multivariate.factor.Factor but I was not able to repro my workflow and the .loadings() didn't even return anything.

Am I missing something?

josef-pkt commented 4 years ago

I never seriously used FA, so I'm not sure what the workflow is

Here are two notebooks from the time of the merge when I tried out different functionality. I'm not sure what the status is

https://gist.github.com/josef-pkt/1da753c160a054c9f342e735ee9be05e https://gist.github.com/josef-pkt/5e4f652c732978ce0e18f85164a29a14

github doesn't show cell background colors here's the nbviewer version for the styling notebook https://nbviewer.jupyter.org/gist/josef-pkt/5e4f652c732978ce0e18f85164a29a14

stevenlis commented 4 years ago

@josef-pkt Thanks for the notebook.

I just tried with the examples included in the notebook, and I was not able to repro the example in this file: https://www.stata.com/manuals13/mvfactor.pdf

use http://www.stata-press.com/data/r13/bg2, clear

factor bg2cost1-bg2cost6, ml
rotate
(Physician-cost data)

(obs=568)
number of factors adjusted to 3
Iteration 0:   log likelihood = -28.605724
Iteration 1:   log likelihood = -.66919454
Iteration 2:   log likelihood = -.00634396
Iteration 3:   log likelihood = -6.465e-06
Iteration 4:   log likelihood = -3.342e-12

Factor analysis/correlation                      Number of obs    =        568
    Method: maximum likelihood                   Retained factors =          3

                                                 Schwarz's BIC    =    95.1318
    Log likelihood = -3.34e-12                   (Akaike's) AIC   =         30

    --------------------------------------------------------------------------
         Factor  |   Eigenvalue   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      1.07659      0.31904            0.5229       0.5229
        Factor2  |      0.75755      0.53268            0.3679       0.8908
        Factor3  |      0.22487            .            0.1092       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(15) =  269.07 Prob>chi2 = 0.0000
    (the model with 3 factors is saturated)

Factor loadings (pattern matrix) and unique variances

    -----------------------------------------------------------
        Variable |  Factor1   Factor2   Factor3 |   Uniqueness 
    -------------+------------------------------+--------------
        bg2cost1 |   0.2315    0.3840   -0.1662 |      0.7713  
        bg2cost2 |  -0.4172    0.3811   -0.2541 |      0.6161  
        bg2cost3 |  -0.4431    0.4662    0.1354 |      0.5680  
        bg2cost4 |  -0.3133    0.2014    0.2578 |      0.7948  
        bg2cost5 |   0.5564    0.3600    0.1749 |      0.5302  
        bg2cost6 |   0.4948    0.2780   -0.1315 |      0.6606  
    -----------------------------------------------------------

Factor analysis/correlation                      Number of obs    =        568
    Method: maximum likelihood                   Retained factors =          3
    Rotation: orthogonal varimax (Kaiser off)    Number of params =         15
                                                 Schwarz's BIC    =    95.1318
    Log likelihood = -3.34e-12                   (Akaike's) AIC   =         30

    --------------------------------------------------------------------------
         Factor  |     Variance   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      0.91912      0.05115            0.4464       0.4464
        Factor2  |      0.86797      0.59607            0.4216       0.8679
        Factor3  |      0.27190            .            0.1321       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(15) =  269.07 Prob>chi2 = 0.0000
    (the model with 3 factors is saturated)

Rotated factor loadings (pattern matrix) and unique variances

    -----------------------------------------------------------
        Variable |  Factor1   Factor2   Factor3 |   Uniqueness 
    -------------+------------------------------+--------------
        bg2cost1 |   0.3902    0.0995    0.2579 |      0.7713  
        bg2cost2 |  -0.1251    0.4910    0.3567 |      0.6161  
        bg2cost3 |  -0.0324    0.6565    0.0050 |      0.5680  
        bg2cost4 |  -0.0772    0.4053   -0.1869 |      0.7948  
        bg2cost5 |   0.6784   -0.0497   -0.0845 |      0.5302  
        bg2cost6 |   0.5329   -0.1389    0.1900 |      0.6606  
    -----------------------------------------------------------

Factor rotation matrix

    -----------------------------------------
                 | Factor1  Factor2  Factor3 
    -------------+---------------------------
         Factor1 |  0.7707  -0.6368  -0.0230 
         Factor2 |  0.6183   0.7386   0.2687 
         Factor3 |  0.1541   0.2213  -0.9630 
    -----------------------------------------
from statsmodels.datasets import webuse
from statsmodels.multivariate.factor import Factor

df = webuse(data='bg2')
endog = [col for col in df.columns if col.startswith('bg2')]

model = Factor(endog=df[endogs], n_factor=3, method='ml')
results = model.fit()
results.rotate('varimax')

print(results.summary())
actor analysis results
======================================================

------------------------------------------------------
               Communality                            
------------------------------------------------------
 bg2cost1 bg2cost2 bg2cost3 bg2cost4 bg2cost5 bg2cost6
------------------------------------------------------
   0.2287   0.3839   0.4320   0.2052   0.4698   0.3394
------------------------------------------------------

------------------------------------------------------
            Pre-rotated loadings                      
---------------------------------------------------------------
                   factor 0          factor 1          factor 2
---------------------------------------------------------------
bg2cost1             0.2315           -0.1662            0.3840
bg2cost2            -0.4172           -0.2541            0.3811
bg2cost3            -0.4431            0.1354            0.4662
bg2cost4            -0.3133            0.2578            0.2014
bg2cost5             0.5564            0.1749            0.3600
bg2cost6             0.4948           -0.1315            0.2780
------------------------------------------------------

------------------------------------------------------
          varimax rotated loadings                    
---------------------------------------------------------------
                   factor 0          factor 1          factor 2
---------------------------------------------------------------
bg2cost1             0.3902           -0.2579            0.0995
bg2cost2            -0.1251           -0.3565            0.4911
bg2cost3            -0.0324           -0.0048            0.6565
bg2cost4            -0.0772            0.1870            0.4053
bg2cost5             0.6784            0.0844           -0.0497
bg2cost6             0.5329           -0.1901           -0.1389
======================================================

The interesting thing is the second and third factors are switched (some loadings have different signs) in the results.

josef-pkt commented 4 years ago

I'm too far away from this right now to figure out the details.

two things, AFAIR statsmodels uses a different algorithm and not maximum likelihood. I do't know how much effect this has

some loadings have different signs

Some of the linear algebra doesn't define a unique representation, and depend on the underlying linalg implementation. For example, eigenvectors can have arbitrary signs (eigenvector multiplied by -1)

In some cases, I had to standardize the signs in order to get reproducible signs across packages. But I never checked this for FA, and don't know if there is a place where we should fix the sign.

unit test has a comment: "# Note: Stata has second factor with flipped sign compared to statsmodels" but I think I just assumed that is arbitrary and didn't investigate. unit tests adjust the sign to match stata

I don't remember details but I compared some parts with Stata https://github.com/statsmodels/statsmodels/pull/4167#issuecomment-353184422

kshedden commented 4 years ago

We do have ML, it is based on the Bai and Li paper which may use slightly different normalizations. There are tests that pass against R factanal.

josef-pkt commented 4 years ago

(I thought we were on the mailing list)

my mistake of not keeping up and reading carefully enough, method='ml'

method pa sorts loadings by size of eigenvalues.

Based on a quick look method='ml' might have in some cases random ordering of loading columns, IIUC

The starting values are created from a random initial loading matrix _fit_ml_em AFAICS, there is no sorting to get a deterministic sequence of loadings. or do I miss something?

stevenlis commented 4 years ago

The doc:

The method to extract factors, currently must be either ‘pa’ for principal axis factor analysis or ‘ml’ for maximum likelihood estimation.

Is this a typo or maybe I missed something:

Btw, the Factor should list under multivariate right @bashtage

Snipaste_634 Snipaste_361
josef-pkt commented 4 years ago

The method to extract factors, currently must be either ‘pa’ for principal axis factor analysis or ‘ml’ for maximum likelihood estimation

The doc is correct. method='ml' just slipped my mind because I never played with it.

bashtage commented 4 years ago

Yes, should probably be moved.

On Fri, Feb 28, 2020 at 7:35 PM StevenLi-DS notifications@github.com wrote:

The doc https://www.statsmodels.org/stable/generated/statsmodels.multivariate.factor.Factor.html#statsmodels.multivariate.factor.Factor :

The method to extract factors, currently must be either ‘pa’ for principal axis factor analysis or ‘ml’ for maximum likelihood estimation.

Is this a typo or maybe I missed something:

Btw, the Factor should list under multivariate right @bashtage https://github.com/bashtage

[image: Snipaste_634] https://user-images.githubusercontent.com/33796896/75581389-5d1b6900-5a37-11ea-9c9e-6d7f4a2ede7f.png

[image: Snipaste_361] https://user-images.githubusercontent.com/33796896/75581390-5d1b6900-5a37-11ea-834c-6aaa7735bd05.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/6559?email_source=notifications&email_token=ABKTSROOBBRNI5ZRKQ345H3RFFROZA5CNFSM4K5I6QQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJ4AGI#issuecomment-592691225, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKTSRNSYT6OOC4HXVTNP5LRFFROZANCNFSM4K5I6QQQ .