jeffgortmaker / pyblp

BLP Demand Estimation with Python
https://pyblp.readthedocs.io
MIT License
240 stars 83 forks source link

Nested Logit: two-step fromPyBLP doesn't match two-step efficient GMM from Stata #121

Closed QWangPKU closed 2 years ago

QWangPKU commented 2 years ago

Hi,

First thing first, I want to thank you for this awesome package!

While using it, I have a question when I try to use the two-step GMM in PyBLP to replicate the two-step efficient GMM results from Stata in the Nested Logit case. I thought they will be the same because there is no ambiguity in theory, the formula is explicit and it looks like on paper the two packages are on the same page. But they yield quite different estimates.

Although the weighting matrix is defined in the same way according to the documentation (maybe different by a scalar, which shouldn't matter), I suspect numerically it is calculated differently. But I am not sure if this is the reason.

It's highly appreciated if you could provide some insights on why they differ. Thanks!

Details are as follows (I tried to hard code the problem in Matlab also, the results from Matlab are the same as those from Stata, so I don't put the details of Matlab here):

  1. In PyBLP. I use the fake cereal product data from [Nevo (2000)] by calling pyblp.data.NEVO_PRODUCTS_LOCATION in Python. To construct the nest, I treat all inside goods as one nest. log_sj_s0 is my L.H.S variable and my X1 consists of price and log_share_of_nest. There is no constant in my setting. And I specify the method = '2s'. I tried two cases: SE is not clustered and SE is clustered at the city level.

1.1. if SE is not clustered at any level.

nevo_product_data = pd.read_csv(pyblp.data.NEVO_PRODUCTS_LOCATION)
df = nevo_product_data.copy()
df['nesting_ids'] = 1
groups = df.groupby(['market_ids', 'nesting_ids'])
df['demand_instruments20'] = groups['shares'].transform(np.size)
nl_formulation = pyblp.Formulation('0 + prices')
problem = pyblp.Problem(nl_formulation, df)
df.to_csv(root + 'nevo_product_data.csv')
nl_results1 = problem.solve(rho=0.7,method = '2s')

The results are :

Problem Results Summary:
=========================================================================================
GMM   Objective     Projected     Reduced    Clipped  Weighting Matrix  Covariance Matrix
Step    Value     Gradient Norm   Hessian    Shares   Condition Number  Condition Number 
----  ----------  -------------  ----------  -------  ----------------  -----------------
 2    +2.033E+02   +1.620E-09    +1.074E+04     0        +2.005E+09        +3.028E+04    
=========================================================================================

Cumulative Statistics:
=================================================
Computation  Optimizer  Optimization   Objective 
   Time      Converged   Iterations   Evaluations
-----------  ---------  ------------  -----------
 00:00:03       Yes          3             8     
=================================================

Rho Estimates (Robust SEs in Parentheses):
============
 All Groups 
------------
 +9.826E-01 
(+1.358E-02)
============

Beta Estimates (Robust SEs in Parentheses):
============
   prices   
------------
 -1.173E+00 
(+3.971E-01)
============

1.2. SE is clustered at the city level.

df = nevo_product_data.copy()
df['nesting_ids'] = 1
groups = df.groupby(['market_ids', 'nesting_ids'])
df['demand_instruments20'] = groups['shares'].transform(np.size)

nl_formulation = pyblp.Formulation('0 + prices')
problem = pyblp.Problem(nl_formulation, df)
df['clustering_ids'] = df['city_ids']
nl_results1 = problem.solve(rho=0.7,W_type='clustered',
                        se_type='clustered',method='2s')
nl_results1

We fail to get the estimates for 'prices' due to the warning that "Detected that the estimated covariance matrix of aggregate GMM moments is singular with condition number +INF."

Problem Results Summary:
=========================================================================================
GMM   Objective     Projected     Reduced    Clipped  Weighting Matrix  Covariance Matrix
Step    Value     Gradient Norm   Hessian    Shares   Condition Number  Condition Number 
----  ----------  -------------  ----------  -------  ----------------  -----------------
 2    +0.000E+00   +0.000E+00    +0.000E+00     0           +INF              +INF       
=========================================================================================

Cumulative Statistics:
=================================================
Computation  Optimizer  Optimization   Objective 
   Time      Converged   Iterations   Evaluations
-----------  ---------  ------------  -----------
 00:00:02       Yes          2             6     
=================================================

Rho Estimates (Robust SEs Adjusted for 0 Clusters in Parentheses):
============
 All Groups 
------------
 +9.825E-01 
(+0.000E+00)
============

Beta Estimates (Robust SEs Adjusted for 0 Clusters in Parentheses):
============
   prices   
------------
 +0.000E+00 
(+0.000E+00)
============
  1. In Stata, I use the same data and the same specification. In particular, the two-step efficient GMM estimation method is specified.
import delimited $root/nevo_product_data.csv, varnames(1) case(preserve) colrange(1) encoding(utf-8) clear

bys market_ids: egen in_share = sum(shares)
gen out_share = 1 - in_share
gen log_sj_s0 = log(shares) - log(out_share)
gen log_share_of_nest = log(shares) - log(in_share)

global ivs demand_instruments*

2.1. if the SE is not clustered at any level.

ivreghdfe log_sj_s0 (prices log_share_of_nest = $ivs), gmm2s noconstant

The results are:

        log_sj_s0 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
           prices |  -1.329311   .4116299    -3.23   0.001     -2.13609   -.5225308
log_share_of_nest |   .9824626   .0140987    69.68   0.000     .9548297    1.010096

2.2. if the SE is clustered at the city level.

ivreghdfe log_sj_s0 (prices log_share_of_nest = $ivs), gmm2s noconstant cluster(city_ids) 

The results are:

log_sj_s0 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
prices |  -.7571483   .2432502    -3.11   0.002     -1.23391   -.2803867
log_share_of_nest |   .9911108   .0148463    66.76   0.000     .9620125    1.020209

As you can see, the results are quite different. But when I compare one-step GMM in PyBLP and 2SLS in Stata, they yield the same results. I am confused why the difference happens when it comes to the two-step GMM even if in theory they are on the same page.

jeffgortmaker commented 2 years ago

Thanks for the in-depth explanation -- helps get me up to speed.

First, you're getting nonsensical results from pyblp when clustering because you're defining clustering IDs after passing your data to the problem. You need to do this before so that it knows what are clusters. This is why you're getting stuff like "0 Clusters" in your output. (I should probably raise an error when this happens.)

Once you fix that and also set rho_bounds=(None, None), you should get something closer to what ivreghdfe is returning. Disabling rho bounds helps here because by default rho is bounded above by 0.99 (values above this can create numerical issues in most situations), but here you actually get an estimate above this.

However, this still gives you something a bit different for the second step. It turns out this is because by default, pyblp "centers moments," while ivreghdfe doesn't. Asymptotically this shouldn't matter, but it can change things a bit in finite samples. If you also set center_moments=False, you'll get essentially the same results, e.g. I get a rho estimate of +9.911108E-01 and a price coefficient of -7.571482E-01. I'd chalk up remaining differences to numerical stuff, e.g. because we optimize over rho while when using ivreghdfe you're using the trick of turning this into a linear IV regression. There may also be some finite sample degrees of freedom corrections to e.g. standard errors in ivreghdfe that we don't do in pyblp. Again, doesn't matter asymptotically, but will have some finite sample implications.

Hope this helps!

jeffgortmaker commented 2 years ago

Thanks for the in-depth explanation -- helps get me up to speed.

First, you're getting nonsensical results from pyblp when clustering because you're defining clustering IDs after passing your data to the problem. You need to do this before so that it knows what are clusters. This is why you're getting stuff like "0 Clusters" in your output. (I should probably raise an error when this happens.)

Once you fix that and also set rho_bounds=(None, None), you should get something closer to what ivreghdfe is returning. Disabling rho bounds helps here because by default rho is bounded above by 0.99 (values above this can create numerical issues in most situations), but here you actually get an estimate above this.

However, this still gives you something a bit different for the second step. It turns out this is because by default, pyblp "centers moments," while ivreghdfe doesn't. Asymptotically this shouldn't matter, but it can change things a bit in finite samples. If you also set center_moments=False, you'll get essentially the same results, e.g. I get a rho estimate of +9.911108E-01 and a price coefficient of -7.571482E-01. I'd chalk up remaining differences to numerical stuff, e.g. because we optimize over rho while when using ivreghdfe you're using the trick of turning this into a linear IV regression. There may also be some finite sample degrees of freedom corrections to e.g. standard errors in ivreghdfe that we don't do in pyblp. Again, doesn't matter asymptotically, but will have some finite sample implications.

Hope this helps!

QWangPKU commented 2 years ago

Thank you so much for such detailed explanations! This makes a lot of sense.

I redo the estimation in 1.1 (without clustering SE) and 1.2(clustering SE) again in PyBLP and specify rho_bounds=(None, None) and center_moments=False this time. The results for 1.2 are the same as those in Stata and Matlab. For 1.1, the estimation for rho in PyBLP is essentially the same as that in Stata. However, the result for the coefficient of prices in 1.1 (beta_price_hat = -1.186) is still a bit off if compared with those from Stata and Matlab (beta_price_hat = -1.3293) although it is slightly closer to the Stata benchmark than my old practice (beta_price_hat = -1.173) where I didn't specify the two options. Just to confirm, are there any other ways to improve the estimation of beta_price or did I miss something?

Please find the details below.

Thank you!

## not clustering SE
nevo_product_data = pd.read_csv(pyblp.data.NEVO_PRODUCTS_LOCATION)
df = nevo_product_data.copy()
df['nesting_ids'] = 1
groups = df.groupby(['market_ids', 'nesting_ids'])
df['demand_instruments20'] = groups['shares'].transform(np.size)
nl_formulation = pyblp.Formulation('0 + prices')
problem = pyblp.Problem(nl_formulation, df)
df.to_csv(root + 'nevo_product_data.csv')
nl_results1 = problem.solve(rho=0.7, 
                            rho_bounds=(None, None), 
                            method = '2s',
                            center_moments=False)  
nl_results1

and the results are:

Problem Results Summary:
======================================================================================
GMM   Objective    Gradient               Clipped  Weighting Matrix  Covariance Matrix
Step    Value        Norm      Hessian    Shares   Condition Number  Condition Number 
----  ----------  ----------  ----------  -------  ----------------  -----------------
 2    +1.865E+02  +7.294E-10  +1.074E+04     0        +1.994E+09        +3.029E+04    
======================================================================================

Cumulative Statistics:
=================================================
Computation  Optimizer  Optimization   Objective 
   Time      Converged   Iterations   Evaluations
-----------  ---------  ------------  -----------
 00:00:03       Yes          2             8     
=================================================

Rho Estimates (Robust SEs in Parentheses):
============
 All Groups 
------------
 +9.826E-01 
(+1.358E-02)
============

Beta Estimates (Robust SEs in Parentheses):
============
   prices   
------------
 -1.186E+00 
(+3.973E-01)
============
jeffgortmaker commented 2 years ago

For sure!

You're still getting slightly different answers because by default, ivreg2 (and most of stata) does not use "robust" covariance matrices, and instead computes covariance matrices under the assumption of homoskedasticity. If you add robust to the end of your ivreghdfe command, you should get the same answer as pyblp.

By default pyblp doesn't make an assumption about homoskedasticity. But if you want to assume this, you could use se_type='unadjusted' and W_type='unadjusted' instead of using the default 'robust'.

QWangPKU commented 2 years ago

Thank you for the timely reply.

Yes! I got the same answers after making them have the same assumption about homoskedasticity.

This is very helpful! Really appreciate it.

jeffgortmaker commented 2 years ago

Glad to help!