Closed jreback closed 8 years ago
cc @jseabold cc @josef-pkt
Any thoughts on requirements/wishes with respect to backwards compatibility or easy transition to a new interface from the perspective of pandas users?
From what I remember when I looked at this years ago, I don't know how much has changed:
OLS is just a moving/expanding wrapper, where we only need to have the slicing code additionally to what statsmodels already has.
PLM is fixed effects panel data, adding the fixed effect dummies is quite easy, but we have the within estimator in the panel PR that demeans and avoids the explicit fixed effect dummies. panel hac robust standard errors are available but are not the default and have most likely different degrees of freedom correction. ( I never compared.) And interface to robust standard errors in statsmodels is still not convenient enough.
VAR in pandas: I don't know anything about that. From a quick look it needs a check to see what is different or whether there are any differences to the statsmodels version.
I never managed to figure out fama_macbeth in pandas (and I don't really know what it's supposed to do). Test coverage, last time I looked, might have been non-existing. From a quick look at the source, there are still only a few smoke tests.
So it looks to me that fama_macbeth needs verifying unit tests before we would merge it into statsmodels. The rest is mainly the wrapping around existing verified statsmodels code.
@josef-pkt not really sure what if anything is actually used anymore.
I personally used to use some of the OLS stuff but switched to statsmodels a while back.
I think that its prob easiest / best for you guys to take what you want / think should be in statsmodels if you dont' have it and it makes sense.
I think pandas can easily put up a deprecation warning prob in 0.14, but not actually remove for a bit after.
I don't think it makes sense to support any of the listed items going forward in pandas; a possibility is to provide wrappers back to statsmodels (though your API is slightly different), so not convinced we should do this.
If all of this code is deprecated and removed, I think pandas could actually remove statsmodels as a dep entirely; pandas obviously would remain a dep of statsmodels. Would remove the circular dep in any event.
with respect to the deprecation comments in #5884 and here
statsmodels doesn't support any of the moving/expanding ols parts out of the box. Although it's just looping anyway. I don't think it would be a lot of work to transfer this, but it depends on whether there is really a demand for it to be worth the effort.
Except for the untested fama_macbeth it's mainly a maintenance issue, since all the pandas integration parts are so far almost exclusively maintained by Skipper. I think for all the statistics/econometrics there should be already or soon a equivalent statsmodels functionality except for maybe a few details.
statsmodels is emphatically not a dep of pandas. it never has been and there has never been a circular dependency.
We do allow ourselves to use statsmodels as a dependency for testing and cross-validation only, that's all and intentionally so. This has been a long standing agreement between the two projects.
afaik, There is no user-facing code in pandas that relies on statsmodels. there shouldn't be.
FWIW, from people I've talked to and comments I've seen I do think some of this code is used in production environments, so we definitely need a deprecation cycle. I'm not too concerned about API compatibility provided you the deprecation cycle is long enough (time based maybe not release based since pandas is much faster than statsmodels). My thinking was always to keep a similar, separate rolling ols function and to fold the panel stuff into the to-be-merged https://github.com/statsmodels/statsmodels/issues/1133 with the new API there.
mis spoke about the dep - it's purely optional in pandas of course
prob makes sense to put up a depr warning in 0.14 - release looking at 3 months out
@jseabold if that looks too aggressive then can defer depr to 0.15
removal can be 1-2 release after
I hope you guys don't mind this note of interjection. I noticed the following:
I never managed to figure out fama_macbeth in pandas (and I don't really know what it's supposed to do).
I use pandas and I use Fama-MacBeth so I thought I could shed some light. Fama-MacBeth is still a well used econometric method in academic finance (less so than it was 10 years ago but still pretty common). It's most common on the asset pricing research side of things. It's a cross-sectional regression method that is meant to handle specifications where there is cross-correlation among the entities (usually stocks). The closest analog to Fama-MacBeth is a pooled regression with time fixed effects and standard errors adjusted for clustering on time as well (they are not the same econometrically but they tend to give similar results and are meant to overcome similar problems with the data).
Here is a typical case. A panel with all US stocks over time (say monthly 1963:07 - 2013:03):
ret logme logbm
caldt permno
1987-07-01 10001 2.1277 1.761665 0.014313
1987-08-01 10001 8.3333 1.782719 0.014313
1987-09-01 10001 -2.2308 1.862761 0.014313
1987-10-01 10001 2.0000 1.824549 0.014313
1987-11-01 10001 -2.9412 1.844352 0.014313
...
ret logme logbm
caldt permno
2012-11-01 93436 20.2215 8.071144 -2.590342
2012-12-01 93436 0.1478 8.255310 -2.590342
2013-01-01 93436 10.7470 8.260604 -2.590342
2013-02-01 93436 -7.1448 8.362681 -2.590342
2013-03-01 93436 8.7855 8.288553 -2.590342
I want to look at cross-section relation between future returns and market-cap and book-to-market (pretty close to what Fama and French did in 1992):
r_{i,t+1) = a + b1log(mcap{it}) + b2log(BM{it}) + e_{it},
If I estimate with Fama-MacBeth, there are two stages:
print pd.fama_macbeth(y=stk.ret,x=stk.ix[:,['logme','logbm']])
----------------------Summary of Fama-MacBeth Analysis-------------------------
Formula: Y ~ logme + logbm + intercept
# betas : 597
----------------------Summary of Estimated Coefficients------------------------
Variable Beta Std Err t-stat CI 2.5% CI 97.5%
(logme) -0.1652 0.0449 -3.68 -0.2533 -0.0772
(logbm) 0.2927 0.0674 4.34 0.1606 0.4248
(intercept) 2.1204 0.3603 5.89 1.4143 2.8265
--------------------------------End of Summary---------------------------------
Here is comparison with a monthly pooled regressions with month fixed effects and standard errors clustered by month:
. areg ret logme logbm, absorb(caldt) cluster(caldt)
Linear regression, absorbing indicators Number of obs = 2214354
F( 2, 596) = 27.96
Prob > F = 0.0000
R-squared = 0.1098
Adj R-squared = 0.1096
Root MSE = 16.8355
(Std. Err. adjusted for 597 clusters in caldt)
------------------------------------------------------------------------------
| Robust
ret | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
logme | -.1592036 .0532617 -2.99 0.003 -.2638071 -.0546
logbm | .4048446 .0952844 4.25 0.000 .2177106 .5919785
_cons | 2.218065 .2280123 9.73 0.000 1.77026 2.66587
-------------+----------------------------------------------------------------
caldt | absorbed (597 categories)
@kdiether Thanks that's useful Given that you mention cluster robust standard errors, I remember that Mitchell A. Petersen: Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches http://rfs.oxfordjournals.org/content/22/1/435.short described it. But I mostly skipped Fama-MacBeth when I wrote the cluster robust and HAC covariance code.
Do you have an example script for your use of fama_macbeth? If we want to include it in statsmodels we need to see how it works and be able to get unit tests for it.
If having something close to fama macbeth is enough, than we could just rely on standard panel methods, with fixed effects and similar automatic support for pandas, and we already have cluster robust, two-way cluster robust, panel HAC and Driscoll–Kraay, mostly consistent with Stata available. (Defaults for small sample corrections and degrees of freedom are still a bit of an open question. They are similar but not all the same as Stata's.)
(edited, because I hit enter by accident)
@josef-pkt Yep, the Petersen paper discusses how Fama-MacBeth differs from other panel data methods. Also, Cochrane's Asset Pricing book has a discussion of the econometrics of Fama-MacBeth. Fama's (1976) Foundations of Finance has a discussion of it too (the disucssion of how the coefficients from Fama-MacBeth are equivalent to zero cost portfolios is useful). Mitch's paper is probably the most useful reference in this context.
(I edited my previous comment) Stata has a user written "XTFMB: Stata module to execute Fama-MacBeth two-step panel regression" http://ideas.repec.org/c/boc/bocode/s456786.html that could be used to verify the results.
Do you have an example script for your use of fama_macbeth?
Sure, no problem. I just need to switch what I did above from CRSP/Compustat data to something using publicly available data.
I get the same estimates using panda's Fama-MacBeth and Mitch Petersen's Fama-MacBeth ado stata module:
Pandas
print pd.fama_macbeth(y=stk.ret,x=stk.ix[:,['logme','logbm']])
----------------------Summary of Fama-MacBeth Analysis-------------------------
Formula: Y ~ logme + logbm + intercept
# betas : 597
----------------------Summary of Estimated Coefficients------------------------
Variable Beta Std Err t-stat CI 2.5% CI 97.5%
(logme) -0.1652 0.0449 -3.68 -0.2533 -0.0772
(logbm) 0.2927 0.0674 4.34 0.1606 0.4248
(intercept) 2.1204 0.3603 5.89 1.4143 2.8265
Stata
. fm ret logme logbm
Fama-MacBeth Estimation
panel variable: permno Number of obs = 2214354
time variable: caldt Number of caldt(s) = 597
R-squared = 0.0015
------------------------------------------------------------------------------
ret | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
logme | -.165241 .0449536 -3.68 0.000 -.2535273 -.0769546
logbm | .2926917 .06747 4.34 0.000 .1601844 .4251991
_cons | 2.120404 .3605668 5.88 0.000 1.412271 2.828538
------------------------------------------------------------------------------
597 caldt regressions
Fama-MacBeth would have been used a lot by some of the guys at AQR so my guess is the pandas implementation of it has very good numerical accuracy.
@josef-pkt In terms of whether you want it statsmodels, it really depends on how many academic finance types (or quant types with academic backgrounds) you have in the userbase. Even my recent papers still use Fama-MacBeth. For me, personally, I would really like to see it still available.
Here is an example. It is an industry momentum F-M regression using Ken French's (http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) 17 industry monthly portfolio returns from 2010-2013 (if you use a longer sample this will give a nice significant relation). The specification is the following:
r{i,t+1) = a + b1*(r{i,t-12:t-2}) + e_{it},
Data:
caldt,Food,Mines,Oil,Clths,Durbl,Chems,Cnsum,Cnstr,Steel,FabPr,Machn,Cars,Trans,Utils,Rtail,Finan,Other
201001,-2.15,-12.82,-4.63,-3.60,-3.50,-4.33,-0.38,-2.77,-12.46,-2.91,-7.31,0.36,-2.13,-4.45,-1.30,-0.76,-3.88
201002,1.75,12.45,2.17,7.29,9.71,5.78,1.37,5.83,4.49,7.35,7.20,6.60,6.17,-0.44,3.69,2.92,2.57
201003,4.97,7.23,3.12,10.72,5.53,7.45,3.23,5.37,11.96,5.33,7.76,8.83,8.99,3.09,6.57,8.28,6.23
201004,-1.35,-0.77,3.83,5.21,6.85,2.56,-2.03,10.86,-2.75,3.43,4.78,6.56,2.93,2.86,1.31,1.15,1.42
201005,-4.85,-8.17,-10.65,-6.54,-5.41,-10.62,-6.25,-8.03,-7.52,-8.96,-8.02,-7.19,-7.55,-6.31,-5.69,-9.00,-8.01
201006,-1.96,-7.03,-6.36,-9.77,-8.80,-8.07,-0.03,-13.94,-13.25,-6.34,-5.94,-8.09,-7.22,-0.73,-9.19,-7.17,-4.81
201007,6.69,9.59,7.39,8.09,4.03,14.85,3.45,4.84,8.27,10.40,7.62,16.40,9.68,6.79,4.95,6.53,7.19
201008,-0.63,-0.41,-3.26,-7.06,-3.80,0.02,-0.40,-4.59,-8.74,-3.21,-9.11,-8.68,-6.66,0.40,-3.39,-8.43,-4.27
201009,3.20,11.88,9.38,16.99,11.62,9.94,7.07,10.62,12.29,10.80,14.00,12.59,10.12,3.64,12.45,7.23,10.44
201010,4.01,8.08,4.42,3.50,1.65,8.97,2.91,-1.06,4.72,7.59,5.02,9.46,5.56,1.85,2.42,1.69,4.48
201011,-0.24,5.94,5.30,8.17,5.28,0.65,-4.04,2.30,2.95,3.37,2.17,7.58,0.72,-1.31,5.08,-0.27,-1.10
201012,4.55,14.12,9.12,0.51,6.72,7.30,4.48,13.47,13.38,6.06,6.13,7.39,3.92,3.23,1.70,10.46,6.86
201101,-1.95,-4.91,7.27,-3.66,-1.96,1.72,-1.76,3.80,3.96,2.45,5.07,-0.31,2.60,2.21,-1.81,2.53,2.27
201102,2.89,0.93,8.01,8.24,3.94,5.02,3.77,2.29,4.13,2.67,2.62,-2.49,1.87,2.42,1.68,2.78,4.17
201103,2.22,2.65,1.44,-4.05,5.27,1.65,1.86,1.77,0.86,3.02,-1.02,2.12,2.93,0.92,0.98,-1.54,0.46
201104,4.11,-1.21,1.50,7.68,4.99,5.14,6.44,-0.33,0.69,2.08,2.66,1.74,2.98,4.01,5.17,0.67,3.11
201105,1.85,-5.82,-4.52,1.20,0.63,-3.61,3.03,-3.08,-5.38,-0.22,-2.89,-2.70,-0.81,1.31,1.42,-2.63,-1.01
201106,-0.01,-3.24,-2.36,2.35,0.80,-0.36,-2.68,-2.54,-2.07,-0.89,-2.90,-0.89,0.08,-0.34,-0.85,-2.29,-1.45
201107,-1.64,0.31,0.58,1.47,-6.18,-1.65,-1.97,-5.69,-6.10,-5.85,-2.34,-8.66,-5.63,-0.33,0.70,-3.91,-2.84
201108,-0.28,-6.72,-9.89,-5.69,-7.15,-8.58,0.05,-7.14,-13.23,-6.27,-7.85,-9.34,-6.88,0.14,-2.60,-9.31,-6.05
201109,-4.44,-24.47,-12.07,-5.53,-13.22,-16.77,-3.35,-9.85,-20.70,-12.18,-7.77,-11.77,-7.31,-1.62,-3.24,-11.18,-6.14
201110,3.96,21.99,16.32,16.39,16.55,18.36,4.84,15.63,18.52,17.87,14.69,21.04,13.03,6.66,8.81,13.53,10.16
201111,0.56,-0.09,1.93,-3.16,-3.35,-1.19,2.74,6.05,-1.49,3.82,-1.16,-7.13,1.68,0.98,0.15,-3.14,-0.62
201112,2.08,-8.36,-0.23,-1.94,-2.61,0.26,3.85,3.06,-5.31,-4.01,-1.48,-0.91,0.88,2.93,0.40,1.71,0.57
201201,0.67,12.00,1.48,8.44,10.26,11.48,-1.00,7.62,12.50,6.28,11.53,12.63,5.73,-2.71,3.40,7.31,5.49
201202,0.99,-4.35,6.75,6.34,10.27,1.89,3.38,5.68,-2.62,5.50,6.80,5.11,1.70,1.30,2.88,6.50,4.10
201203,3.70,-7.11,-3.13,1.27,4.49,2.47,3.89,5.50,0.58,1.98,3.58,0.77,0.44,1.45,4.76,6.95,3.31
201204,0.84,-0.73,-1.67,-0.04,-4.11,0.34,0.25,2.12,-3.86,-1.11,-2.69,-3.85,0.63,1.73,1.11,-2.32,-0.39
201205,-1.50,-14.58,-10.56,-5.67,-11.17,-9.53,-2.82,-6.67,-14.71,-5.49,-8.46,-7.78,-4.79,-0.91,-1.57,-9.17,-5.36
201206,3.85,6.35,5.84,-9.12,-0.72,3.26,4.92,6.12,4.85,0.00,1.38,-6.03,3.50,3.40,1.99,4.61,5.17
201207,0.39,-3.95,3.10,2.22,1.60,-1.28,3.20,-2.58,-1.90,2.87,1.18,0.38,-0.51,3.63,1.60,-1.06,0.23
201208,-0.14,4.99,2.40,7.76,6.26,2.79,-0.37,8.59,-1.20,0.77,5.53,3.00,0.54,-3.41,2.59,4.20,2.53
201209,1.16,8.37,3.59,-0.69,3.06,1.06,3.01,6.19,3.49,3.93,-0.44,2.75,-0.84,2.12,1.78,3.58,4.26
201210,-1.81,3.24,-2.17,-0.29,2.05,-2.66,-0.56,2.59,-0.52,-0.73,-5.99,6.06,2.18,1.08,-2.44,1.91,-3.47
201211,3.35,-3.61,-1.32,3.91,2.14,2.53,0.87,5.04,2.51,2.78,1.67,3.81,1.39,-3.83,0.64,-0.88,1.44
201212,-1.25,0.56,1.20,-0.23,0.74,4.41,-1.67,0.81,6.78,1.66,0.14,5.75,2.78,0.10,-1.40,4.31,1.58
201301,5.34,0.28,8.06,3.62,8.37,4.97,8.10,8.08,4.31,5.28,0.76,3.22,5.43,5.05,4.67,6.24,6.32
201302,3.96,-7.14,0.48,2.10,2.35,-0.73,2.04,-0.14,-2.93,2.95,-0.12,0.51,3.43,2.09,0.43,1.30,1.82
201303,6.27,1.18,2.18,3.88,5.86,1.09,4.69,3.09,2.10,3.87,2.10,3.64,5.45,5.55,5.35,4.50,4.13
201304,2.87,-9.65,-1.31,7.59,0.28,3.28,3.22,1.38,-4.18,-1.85,-0.76,4.48,0.13,5.12,3.20,1.61,1.81
201305,-2.94,0.45,3.04,1.01,2.99,3.29,-0.07,6.14,3.89,6.10,5.43,8.90,4.60,-7.09,1.16,7.05,2.60
201306,0.79,-11.01,-2.12,1.77,-2.33,-4.77,-0.96,-4.21,-5.01,-2.43,-3.81,0.21,-0.01,0.78,0.32,-0.45,-1.18
201307,3.14,1.93,4.95,2.50,4.05,6.73,6.53,3.49,6.90,8.17,6.76,8.53,5.61,4.60,5.85,6.43,5.29
201308,-3.83,3.56,-1.88,-2.26,-2.48,-0.40,-3.97,-3.78,-2.91,-1.42,-0.45,-1.02,-2.61,-4.43,-4.62,-4.21,-1.97
201309,0.70,1.95,2.02,7.60,6.37,5.02,2.97,5.54,7.92,5.64,3.31,5.96,6.67,1.05,4.40,3.05,4.86
201310,4.93,5.64,4.80,3.05,5.78,3.30,4.90,3.10,7.94,4.83,4.25,0.53,5.12,3.35,5.54,3.30,3.96
201311,1.49,-4.32,0.89,6.82,3.92,1.44,3.78,1.64,1.50,2.85,3.27,1.62,4.94,-1.81,4.07,6.07,2.61
201312,2.48,6.62,2.92,1.99,2.91,5.53,0.47,3.61,6.73,3.97,3.99,1.43,2.92,1.79,-0.78,2.61,3.56
Code for the example:
#!/usr/bin/env python
from datetime import datetime
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
def cumret(ret,beg,end):
return pd.rolling_sum(np.log(1+ret),window=beg-end+1).shift(end)
if __name__ == '__main__':
parse = lambda x: datetime(int(x[0:4]),int(x[4:6]),1)
ind = pd.read_csv('data/test.csv',index_col='caldt',
parse_dates='caldt',date_parser=parse)
ind = ind.stack().reset_index()
ind.columns = ['caldt','industry','ret']
ind['ret'] = ind['ret']/100
ind.set_index(['caldt','industry'],inplace=True)
ind['rmom'] = ind['ret'].groupby(level='industry').apply(cumret,12,2);
ind = ind[ind.rmom.notnull()]
print ind.head(20)
print pd.fama_macbeth(y=ind['ret'],x=ind.ix[:,['rmom']])
The output
ret rmom
caldt industry
2011-01-01 Food -0.0195 0.087866
Mines -0.0491 0.219223
Oil 0.0727 0.086754
Clths -0.0366 0.290310
Durbl -0.0196 0.207907
Chems 0.0172 0.238446
Cnsum -0.0176 0.041910
Cnstr 0.0380 0.063438
Steel 0.0396 -0.045276
FabPr 0.0245 0.242847
Machn 0.0507 0.149602
Cars -0.0031 0.399784
Trans 0.0260 0.181024
Utils 0.0221 0.046779
Rtail -0.0181 0.149441
Finan 0.0253 0.002477
Other 0.0227 0.085414
2011-02-01 Food 0.0289 0.154096
Mines 0.0093 0.488499
Oil 0.0801 0.221438
[20 rows x 2 columns]
----------------------Summary of Fama-MacBeth Analysis-------------------------
Formula: Y ~ rmom + intercept
# betas : 36
----------------------Summary of Estimated Coefficients------------------------
Variable Beta Std Err t-stat CI 2.5% CI 97.5%
(rmom) 0.0051 0.0208 0.24 -0.0356 0.0458
(intercept) 0.0148 0.0065 2.27 0.0020 0.0276
--------------------------------End of Summary---------------------------------
Note, the reported number of estimated time-series coefficients (#betas) is 36 (in the first stage, 36 cross-sectional regression are estimated). There are 48 months but I lost 12 months for each industry computing the return from t-12 to t-2.
For comparison here are the F-M results if I estimate the same thing using data from 1926-2013:
----------------------Summary of Fama-MacBeth Analysis-------------------------
Formula: Y ~ rmom + intercept
# betas : 1038
----------------------Summary of Estimated Coefficients------------------------
Variable Beta Std Err t-stat CI 2.5% CI 97.5%
(rmom) 0.0206 0.0039 5.24 0.0129 0.0284
(intercept) 0.0061 0.0015 4.07 0.0032 0.0090
--------------------------------End of Summary---------------------------------
@kdiether Sorry, I forgot to reply here. Thanks for the example. Now that we know fama_macbeth works and where to find a reference implementation for the unit tests, I don't see any problems including it in statsmodels.
Just a question of manpower and who is doing the work.
let's deprecate some of these in 0.18.0
related #5884