pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.93k forks source link

MIGRATE: move stats code to statsmodels / deprecate in pandas #6077

Closed jreback closed 8 years ago

jreback commented 10 years ago

related #5884

jreback commented 10 years ago

cc @jseabold cc @josef-pkt

josef-pkt commented 10 years ago

Any thoughts on requirements/wishes with respect to backwards compatibility or easy transition to a new interface from the perspective of pandas users?

From what I remember when I looked at this years ago, I don't know how much has changed:

OLS is just a moving/expanding wrapper, where we only need to have the slicing code additionally to what statsmodels already has.

PLM is fixed effects panel data, adding the fixed effect dummies is quite easy, but we have the within estimator in the panel PR that demeans and avoids the explicit fixed effect dummies. panel hac robust standard errors are available but are not the default and have most likely different degrees of freedom correction. ( I never compared.) And interface to robust standard errors in statsmodels is still not convenient enough.

VAR in pandas: I don't know anything about that. From a quick look it needs a check to see what is different or whether there are any differences to the statsmodels version.

I never managed to figure out fama_macbeth in pandas (and I don't really know what it's supposed to do). Test coverage, last time I looked, might have been non-existing. From a quick look at the source, there are still only a few smoke tests.

So it looks to me that fama_macbeth needs verifying unit tests before we would merge it into statsmodels. The rest is mainly the wrapping around existing verified statsmodels code.

jreback commented 10 years ago

@josef-pkt not really sure what if anything is actually used anymore.

I personally used to use some of the OLS stuff but switched to statsmodels a while back.

I think that its prob easiest / best for you guys to take what you want / think should be in statsmodels if you dont' have it and it makes sense.

I think pandas can easily put up a deprecation warning prob in 0.14, but not actually remove for a bit after.

I don't think it makes sense to support any of the listed items going forward in pandas; a possibility is to provide wrappers back to statsmodels (though your API is slightly different), so not convinced we should do this.

If all of this code is deprecated and removed, I think pandas could actually remove statsmodels as a dep entirely; pandas obviously would remain a dep of statsmodels. Would remove the circular dep in any event.

josef-pkt commented 10 years ago

with respect to the deprecation comments in #5884 and here

statsmodels doesn't support any of the moving/expanding ols parts out of the box. Although it's just looping anyway. I don't think it would be a lot of work to transfer this, but it depends on whether there is really a demand for it to be worth the effort.

Except for the untested fama_macbeth it's mainly a maintenance issue, since all the pandas integration parts are so far almost exclusively maintained by Skipper. I think for all the statistics/econometrics there should be already or soon a equivalent statsmodels functionality except for maybe a few details.

ghost commented 10 years ago

statsmodels is emphatically not a dep of pandas. it never has been and there has never been a circular dependency.

We do allow ourselves to use statsmodels as a dependency for testing and cross-validation only, that's all and intentionally so. This has been a long standing agreement between the two projects.

afaik, There is no user-facing code in pandas that relies on statsmodels. there shouldn't be.

jseabold commented 10 years ago

FWIW, from people I've talked to and comments I've seen I do think some of this code is used in production environments, so we definitely need a deprecation cycle. I'm not too concerned about API compatibility provided you the deprecation cycle is long enough (time based maybe not release based since pandas is much faster than statsmodels). My thinking was always to keep a similar, separate rolling ols function and to fold the panel stuff into the to-be-merged https://github.com/statsmodels/statsmodels/issues/1133 with the new API there.

jreback commented 10 years ago

mis spoke about the dep - it's purely optional in pandas of course

prob makes sense to put up a depr warning in 0.14 - release looking at 3 months out

@jseabold if that looks too aggressive then can defer depr to 0.15

removal can be 1-2 release after

kdiether commented 10 years ago

I hope you guys don't mind this note of interjection. I noticed the following:

I never managed to figure out fama_macbeth in pandas (and I don't really know what it's supposed to do).

I use pandas and I use Fama-MacBeth so I thought I could shed some light. Fama-MacBeth is still a well used econometric method in academic finance (less so than it was 10 years ago but still pretty common). It's most common on the asset pricing research side of things. It's a cross-sectional regression method that is meant to handle specifications where there is cross-correlation among the entities (usually stocks). The closest analog to Fama-MacBeth is a pooled regression with time fixed effects and standard errors adjusted for clustering on time as well (they are not the same econometrically but they tend to give similar results and are meant to overcome similar problems with the data).

Here is a typical case. A panel with all US stocks over time (say monthly 1963:07 - 2013:03):

                      ret     logme     logbm
caldt      permno
1987-07-01 10001   2.1277  1.761665  0.014313
1987-08-01 10001   8.3333  1.782719  0.014313
1987-09-01 10001  -2.2308  1.862761  0.014313
1987-10-01 10001   2.0000  1.824549  0.014313
1987-11-01 10001  -2.9412  1.844352  0.014313
...
                       ret     logme     logbm
caldt      permno
2012-11-01 93436   20.2215  8.071144 -2.590342
2012-12-01 93436    0.1478  8.255310 -2.590342
2013-01-01 93436   10.7470  8.260604 -2.590342
2013-02-01 93436   -7.1448  8.362681 -2.590342
2013-03-01 93436    8.7855  8.288553 -2.590342

I want to look at cross-section relation between future returns and market-cap and book-to-market (pretty close to what Fama and French did in 1992):

r_{i,t+1) = a + b1log(mcap{it}) + b2log(BM{it}) + e_{it},

If I estimate with Fama-MacBeth, there are two stages:

  1. I estimate a cross-sectional regression (of the above) every month (for every t). So there are a, b1, and b2 estimates for each month (in this case 597 months). So there is a time series of estimates.
  2. I then takes the average of the time series of the estimated coefficients from stage 1. The standard errors and t-stats are computed based on the time series of the estimated coefficients. This is what is computed in the Fama-MacBeth output produced by pandas. In this example it is the following:
 print pd.fama_macbeth(y=stk.ret,x=stk.ix[:,['logme','logbm']])

----------------------Summary of Fama-MacBeth Analysis-------------------------

Formula: Y ~ logme + logbm + intercept
# betas : 597

----------------------Summary of Estimated Coefficients------------------------
     Variable          Beta       Std Err        t-stat       CI 2.5%      CI 97.5%
      (logme)       -0.1652        0.0449         -3.68       -0.2533       -0.0772
      (logbm)        0.2927        0.0674          4.34        0.1606        0.4248
  (intercept)        2.1204        0.3603          5.89        1.4143        2.8265

--------------------------------End of Summary---------------------------------

Here is comparison with a monthly pooled regressions with month fixed effects and standard errors clustered by month:

. areg ret logme logbm, absorb(caldt) cluster(caldt)

Linear regression, absorbing indicators           Number of obs   =    2214354
                                                  F(   2,    596) =      27.96
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.1098
                                                  Adj R-squared   =     0.1096
                                                  Root MSE        =    16.8355

                                (Std. Err. adjusted for 597 clusters in caldt)
------------------------------------------------------------------------------
             |               Robust
         ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       logme |  -.1592036   .0532617    -2.99   0.003    -.2638071      -.0546
       logbm |   .4048446   .0952844     4.25   0.000     .2177106    .5919785
       _cons |   2.218065   .2280123     9.73   0.000      1.77026     2.66587
-------------+----------------------------------------------------------------
       caldt |   absorbed                                     (597 categories)
josef-pkt commented 10 years ago

@kdiether Thanks that's useful Given that you mention cluster robust standard errors, I remember that Mitchell A. Petersen: Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches http://rfs.oxfordjournals.org/content/22/1/435.short described it. But I mostly skipped Fama-MacBeth when I wrote the cluster robust and HAC covariance code.

Do you have an example script for your use of fama_macbeth? If we want to include it in statsmodels we need to see how it works and be able to get unit tests for it.

If having something close to fama macbeth is enough, than we could just rely on standard panel methods, with fixed effects and similar automatic support for pandas, and we already have cluster robust, two-way cluster robust, panel HAC and Driscoll–Kraay, mostly consistent with Stata available. (Defaults for small sample corrections and degrees of freedom are still a bit of an open question. They are similar but not all the same as Stata's.)

(edited, because I hit enter by accident)

kdiether commented 10 years ago

@josef-pkt Yep, the Petersen paper discusses how Fama-MacBeth differs from other panel data methods. Also, Cochrane's Asset Pricing book has a discussion of the econometrics of Fama-MacBeth. Fama's (1976) Foundations of Finance has a discussion of it too (the disucssion of how the coefficients from Fama-MacBeth are equivalent to zero cost portfolios is useful). Mitch's paper is probably the most useful reference in this context.

josef-pkt commented 10 years ago

(I edited my previous comment) Stata has a user written "XTFMB: Stata module to execute Fama-MacBeth two-step panel regression" http://ideas.repec.org/c/boc/bocode/s456786.html that could be used to verify the results.

kdiether commented 10 years ago
Do you have an example script for your use of fama_macbeth?

Sure, no problem. I just need to switch what I did above from CRSP/Compustat data to something using publicly available data.

I get the same estimates using panda's Fama-MacBeth and Mitch Petersen's Fama-MacBeth ado stata module:

Pandas

 print pd.fama_macbeth(y=stk.ret,x=stk.ix[:,['logme','logbm']])

----------------------Summary of Fama-MacBeth Analysis-------------------------

Formula: Y ~ logme + logbm + intercept
# betas : 597

----------------------Summary of Estimated Coefficients------------------------
     Variable          Beta       Std Err        t-stat       CI 2.5%      CI 97.5%
      (logme)       -0.1652        0.0449         -3.68       -0.2533       -0.0772
      (logbm)        0.2927        0.0674          4.34        0.1606        0.4248
  (intercept)        2.1204        0.3603          5.89        1.4143        2.8265

Stata

. fm ret logme logbm

Fama-MacBeth Estimation

panel variable:  permno                        Number of obs         = 2214354
 time variable:  caldt                         Number of caldt(s)    =     597
                                               R-squared             =  0.0015
------------------------------------------------------------------------------
         ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       logme |   -.165241   .0449536    -3.68   0.000    -.2535273   -.0769546
       logbm |   .2926917     .06747     4.34   0.000     .1601844    .4251991
       _cons |   2.120404   .3605668     5.88   0.000     1.412271    2.828538
------------------------------------------------------------------------------
             597 caldt regressions

Fama-MacBeth would have been used a lot by some of the guys at AQR so my guess is the pandas implementation of it has very good numerical accuracy.

kdiether commented 10 years ago

@josef-pkt In terms of whether you want it statsmodels, it really depends on how many academic finance types (or quant types with academic backgrounds) you have in the userbase. Even my recent papers still use Fama-MacBeth. For me, personally, I would really like to see it still available.

kdiether commented 10 years ago

Here is an example. It is an industry momentum F-M regression using Ken French's (http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) 17 industry monthly portfolio returns from 2010-2013 (if you use a longer sample this will give a nice significant relation). The specification is the following:

r{i,t+1) = a + b1*(r{i,t-12:t-2}) + e_{it},

Data:

caldt,Food,Mines,Oil,Clths,Durbl,Chems,Cnsum,Cnstr,Steel,FabPr,Machn,Cars,Trans,Utils,Rtail,Finan,Other
201001,-2.15,-12.82,-4.63,-3.60,-3.50,-4.33,-0.38,-2.77,-12.46,-2.91,-7.31,0.36,-2.13,-4.45,-1.30,-0.76,-3.88
201002,1.75,12.45,2.17,7.29,9.71,5.78,1.37,5.83,4.49,7.35,7.20,6.60,6.17,-0.44,3.69,2.92,2.57
201003,4.97,7.23,3.12,10.72,5.53,7.45,3.23,5.37,11.96,5.33,7.76,8.83,8.99,3.09,6.57,8.28,6.23
201004,-1.35,-0.77,3.83,5.21,6.85,2.56,-2.03,10.86,-2.75,3.43,4.78,6.56,2.93,2.86,1.31,1.15,1.42
201005,-4.85,-8.17,-10.65,-6.54,-5.41,-10.62,-6.25,-8.03,-7.52,-8.96,-8.02,-7.19,-7.55,-6.31,-5.69,-9.00,-8.01
201006,-1.96,-7.03,-6.36,-9.77,-8.80,-8.07,-0.03,-13.94,-13.25,-6.34,-5.94,-8.09,-7.22,-0.73,-9.19,-7.17,-4.81
201007,6.69,9.59,7.39,8.09,4.03,14.85,3.45,4.84,8.27,10.40,7.62,16.40,9.68,6.79,4.95,6.53,7.19
201008,-0.63,-0.41,-3.26,-7.06,-3.80,0.02,-0.40,-4.59,-8.74,-3.21,-9.11,-8.68,-6.66,0.40,-3.39,-8.43,-4.27
201009,3.20,11.88,9.38,16.99,11.62,9.94,7.07,10.62,12.29,10.80,14.00,12.59,10.12,3.64,12.45,7.23,10.44
201010,4.01,8.08,4.42,3.50,1.65,8.97,2.91,-1.06,4.72,7.59,5.02,9.46,5.56,1.85,2.42,1.69,4.48
201011,-0.24,5.94,5.30,8.17,5.28,0.65,-4.04,2.30,2.95,3.37,2.17,7.58,0.72,-1.31,5.08,-0.27,-1.10
201012,4.55,14.12,9.12,0.51,6.72,7.30,4.48,13.47,13.38,6.06,6.13,7.39,3.92,3.23,1.70,10.46,6.86
201101,-1.95,-4.91,7.27,-3.66,-1.96,1.72,-1.76,3.80,3.96,2.45,5.07,-0.31,2.60,2.21,-1.81,2.53,2.27
201102,2.89,0.93,8.01,8.24,3.94,5.02,3.77,2.29,4.13,2.67,2.62,-2.49,1.87,2.42,1.68,2.78,4.17
201103,2.22,2.65,1.44,-4.05,5.27,1.65,1.86,1.77,0.86,3.02,-1.02,2.12,2.93,0.92,0.98,-1.54,0.46
201104,4.11,-1.21,1.50,7.68,4.99,5.14,6.44,-0.33,0.69,2.08,2.66,1.74,2.98,4.01,5.17,0.67,3.11
201105,1.85,-5.82,-4.52,1.20,0.63,-3.61,3.03,-3.08,-5.38,-0.22,-2.89,-2.70,-0.81,1.31,1.42,-2.63,-1.01
201106,-0.01,-3.24,-2.36,2.35,0.80,-0.36,-2.68,-2.54,-2.07,-0.89,-2.90,-0.89,0.08,-0.34,-0.85,-2.29,-1.45
201107,-1.64,0.31,0.58,1.47,-6.18,-1.65,-1.97,-5.69,-6.10,-5.85,-2.34,-8.66,-5.63,-0.33,0.70,-3.91,-2.84
201108,-0.28,-6.72,-9.89,-5.69,-7.15,-8.58,0.05,-7.14,-13.23,-6.27,-7.85,-9.34,-6.88,0.14,-2.60,-9.31,-6.05
201109,-4.44,-24.47,-12.07,-5.53,-13.22,-16.77,-3.35,-9.85,-20.70,-12.18,-7.77,-11.77,-7.31,-1.62,-3.24,-11.18,-6.14
201110,3.96,21.99,16.32,16.39,16.55,18.36,4.84,15.63,18.52,17.87,14.69,21.04,13.03,6.66,8.81,13.53,10.16
201111,0.56,-0.09,1.93,-3.16,-3.35,-1.19,2.74,6.05,-1.49,3.82,-1.16,-7.13,1.68,0.98,0.15,-3.14,-0.62
201112,2.08,-8.36,-0.23,-1.94,-2.61,0.26,3.85,3.06,-5.31,-4.01,-1.48,-0.91,0.88,2.93,0.40,1.71,0.57
201201,0.67,12.00,1.48,8.44,10.26,11.48,-1.00,7.62,12.50,6.28,11.53,12.63,5.73,-2.71,3.40,7.31,5.49
201202,0.99,-4.35,6.75,6.34,10.27,1.89,3.38,5.68,-2.62,5.50,6.80,5.11,1.70,1.30,2.88,6.50,4.10
201203,3.70,-7.11,-3.13,1.27,4.49,2.47,3.89,5.50,0.58,1.98,3.58,0.77,0.44,1.45,4.76,6.95,3.31
201204,0.84,-0.73,-1.67,-0.04,-4.11,0.34,0.25,2.12,-3.86,-1.11,-2.69,-3.85,0.63,1.73,1.11,-2.32,-0.39
201205,-1.50,-14.58,-10.56,-5.67,-11.17,-9.53,-2.82,-6.67,-14.71,-5.49,-8.46,-7.78,-4.79,-0.91,-1.57,-9.17,-5.36
201206,3.85,6.35,5.84,-9.12,-0.72,3.26,4.92,6.12,4.85,0.00,1.38,-6.03,3.50,3.40,1.99,4.61,5.17
201207,0.39,-3.95,3.10,2.22,1.60,-1.28,3.20,-2.58,-1.90,2.87,1.18,0.38,-0.51,3.63,1.60,-1.06,0.23
201208,-0.14,4.99,2.40,7.76,6.26,2.79,-0.37,8.59,-1.20,0.77,5.53,3.00,0.54,-3.41,2.59,4.20,2.53
201209,1.16,8.37,3.59,-0.69,3.06,1.06,3.01,6.19,3.49,3.93,-0.44,2.75,-0.84,2.12,1.78,3.58,4.26
201210,-1.81,3.24,-2.17,-0.29,2.05,-2.66,-0.56,2.59,-0.52,-0.73,-5.99,6.06,2.18,1.08,-2.44,1.91,-3.47
201211,3.35,-3.61,-1.32,3.91,2.14,2.53,0.87,5.04,2.51,2.78,1.67,3.81,1.39,-3.83,0.64,-0.88,1.44
201212,-1.25,0.56,1.20,-0.23,0.74,4.41,-1.67,0.81,6.78,1.66,0.14,5.75,2.78,0.10,-1.40,4.31,1.58
201301,5.34,0.28,8.06,3.62,8.37,4.97,8.10,8.08,4.31,5.28,0.76,3.22,5.43,5.05,4.67,6.24,6.32
201302,3.96,-7.14,0.48,2.10,2.35,-0.73,2.04,-0.14,-2.93,2.95,-0.12,0.51,3.43,2.09,0.43,1.30,1.82
201303,6.27,1.18,2.18,3.88,5.86,1.09,4.69,3.09,2.10,3.87,2.10,3.64,5.45,5.55,5.35,4.50,4.13
201304,2.87,-9.65,-1.31,7.59,0.28,3.28,3.22,1.38,-4.18,-1.85,-0.76,4.48,0.13,5.12,3.20,1.61,1.81
201305,-2.94,0.45,3.04,1.01,2.99,3.29,-0.07,6.14,3.89,6.10,5.43,8.90,4.60,-7.09,1.16,7.05,2.60
201306,0.79,-11.01,-2.12,1.77,-2.33,-4.77,-0.96,-4.21,-5.01,-2.43,-3.81,0.21,-0.01,0.78,0.32,-0.45,-1.18
201307,3.14,1.93,4.95,2.50,4.05,6.73,6.53,3.49,6.90,8.17,6.76,8.53,5.61,4.60,5.85,6.43,5.29
201308,-3.83,3.56,-1.88,-2.26,-2.48,-0.40,-3.97,-3.78,-2.91,-1.42,-0.45,-1.02,-2.61,-4.43,-4.62,-4.21,-1.97
201309,0.70,1.95,2.02,7.60,6.37,5.02,2.97,5.54,7.92,5.64,3.31,5.96,6.67,1.05,4.40,3.05,4.86
201310,4.93,5.64,4.80,3.05,5.78,3.30,4.90,3.10,7.94,4.83,4.25,0.53,5.12,3.35,5.54,3.30,3.96
201311,1.49,-4.32,0.89,6.82,3.92,1.44,3.78,1.64,1.50,2.85,3.27,1.62,4.94,-1.81,4.07,6.07,2.61
201312,2.48,6.62,2.92,1.99,2.91,5.53,0.47,3.61,6.73,3.97,3.99,1.43,2.92,1.79,-0.78,2.61,3.56

Code for the example:

#!/usr/bin/env python
from datetime import datetime
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

def cumret(ret,beg,end):
  return pd.rolling_sum(np.log(1+ret),window=beg-end+1).shift(end)

if __name__ == '__main__':
  parse = lambda x: datetime(int(x[0:4]),int(x[4:6]),1)
  ind   = pd.read_csv('data/test.csv',index_col='caldt',
                      parse_dates='caldt',date_parser=parse)

  ind = ind.stack().reset_index()
  ind.columns = ['caldt','industry','ret']
  ind['ret'] = ind['ret']/100

  ind.set_index(['caldt','industry'],inplace=True)
  ind['rmom'] = ind['ret'].groupby(level='industry').apply(cumret,12,2);
  ind = ind[ind.rmom.notnull()]

  print ind.head(20)

  print pd.fama_macbeth(y=ind['ret'],x=ind.ix[:,['rmom']])

The output

                        ret      rmom
caldt      industry
2011-01-01 Food     -0.0195  0.087866
           Mines    -0.0491  0.219223
           Oil       0.0727  0.086754
           Clths    -0.0366  0.290310
           Durbl    -0.0196  0.207907
           Chems     0.0172  0.238446
           Cnsum    -0.0176  0.041910
           Cnstr     0.0380  0.063438
           Steel     0.0396 -0.045276
           FabPr     0.0245  0.242847
           Machn     0.0507  0.149602
           Cars     -0.0031  0.399784
           Trans     0.0260  0.181024
           Utils     0.0221  0.046779
           Rtail    -0.0181  0.149441
           Finan     0.0253  0.002477
           Other     0.0227  0.085414
2011-02-01 Food      0.0289  0.154096
           Mines     0.0093  0.488499
           Oil       0.0801  0.221438

[20 rows x 2 columns]

----------------------Summary of Fama-MacBeth Analysis-------------------------

Formula: Y ~ rmom + intercept
# betas :  36

----------------------Summary of Estimated Coefficients------------------------
     Variable          Beta       Std Err        t-stat       CI 2.5%      CI 97.5%
       (rmom)        0.0051        0.0208          0.24       -0.0356        0.0458
  (intercept)        0.0148        0.0065          2.27        0.0020        0.0276

--------------------------------End of Summary---------------------------------

Note, the reported number of estimated time-series coefficients (#betas) is 36 (in the first stage, 36 cross-sectional regression are estimated). There are 48 months but I lost 12 months for each industry computing the return from t-12 to t-2.

For comparison here are the F-M results if I estimate the same thing using data from 1926-2013:

----------------------Summary of Fama-MacBeth Analysis-------------------------

Formula: Y ~ rmom + intercept
# betas : 1038

----------------------Summary of Estimated Coefficients------------------------
     Variable          Beta       Std Err        t-stat       CI 2.5%      CI 97.5%
       (rmom)        0.0206        0.0039          5.24        0.0129        0.0284
  (intercept)        0.0061        0.0015          4.07        0.0032        0.0090

--------------------------------End of Summary---------------------------------
josef-pkt commented 10 years ago

@kdiether Sorry, I forgot to reply here. Thanks for the example. Now that we know fama_macbeth works and where to find a reference implementation for the unit tests, I don't see any problems including it in statsmodels.

Just a question of manpower and who is doing the work.

jreback commented 9 years ago

let's deprecate some of these in 0.18.0