Closed QWangPKU closed 2 years ago
Thanks for the in-depth explanation -- helps get me up to speed.
First, you're getting nonsensical results from pyblp when clustering because you're defining clustering IDs after passing your data to the problem. You need to do this before so that it knows what are clusters. This is why you're getting stuff like "0 Clusters" in your output. (I should probably raise an error when this happens.)
Once you fix that and also set rho_bounds=(None, None)
, you should get something closer to what ivreghdfe is returning. Disabling rho bounds helps here because by default rho is bounded above by 0.99 (values above this can create numerical issues in most situations), but here you actually get an estimate above this.
However, this still gives you something a bit different for the second step. It turns out this is because by default, pyblp "centers moments," while ivreghdfe doesn't. Asymptotically this shouldn't matter, but it can change things a bit in finite samples. If you also set center_moments=False
, you'll get essentially the same results, e.g. I get a rho estimate of +9.911108E-01
and a price coefficient of -7.571482E-01
. I'd chalk up remaining differences to numerical stuff, e.g. because we optimize over rho while when using ivreghdfe you're using the trick of turning this into a linear IV regression. There may also be some finite sample degrees of freedom corrections to e.g. standard errors in ivreghdfe that we don't do in pyblp. Again, doesn't matter asymptotically, but will have some finite sample implications.
Hope this helps!
Thanks for the in-depth explanation -- helps get me up to speed.
First, you're getting nonsensical results from pyblp when clustering because you're defining clustering IDs after passing your data to the problem. You need to do this before so that it knows what are clusters. This is why you're getting stuff like "0 Clusters" in your output. (I should probably raise an error when this happens.)
Once you fix that and also set rho_bounds=(None, None)
, you should get something closer to what ivreghdfe is returning. Disabling rho bounds helps here because by default rho is bounded above by 0.99 (values above this can create numerical issues in most situations), but here you actually get an estimate above this.
However, this still gives you something a bit different for the second step. It turns out this is because by default, pyblp "centers moments," while ivreghdfe doesn't. Asymptotically this shouldn't matter, but it can change things a bit in finite samples. If you also set center_moments=False
, you'll get essentially the same results, e.g. I get a rho estimate of +9.911108E-01
and a price coefficient of -7.571482E-01
. I'd chalk up remaining differences to numerical stuff, e.g. because we optimize over rho while when using ivreghdfe you're using the trick of turning this into a linear IV regression. There may also be some finite sample degrees of freedom corrections to e.g. standard errors in ivreghdfe that we don't do in pyblp. Again, doesn't matter asymptotically, but will have some finite sample implications.
Hope this helps!
Thank you so much for such detailed explanations! This makes a lot of sense.
I redo the estimation in 1.1 (without clustering SE) and 1.2(clustering SE) again in PyBLP and specify rho_bounds=(None, None)
and center_moments=False
this time. The results for 1.2 are the same as those in Stata and Matlab. For 1.1, the estimation for rho in PyBLP is essentially the same as that in Stata. However, the result for the coefficient of prices in 1.1 (beta_price_hat = -1.186) is still a bit off if compared with those from Stata and Matlab (beta_price_hat = -1.3293) although it is slightly closer to the Stata benchmark than my old practice (beta_price_hat = -1.173) where I didn't specify the two options. Just to confirm, are there any other ways to improve the estimation of beta_price or did I miss something?
Please find the details below.
Thank you!
## not clustering SE
nevo_product_data = pd.read_csv(pyblp.data.NEVO_PRODUCTS_LOCATION)
df = nevo_product_data.copy()
df['nesting_ids'] = 1
groups = df.groupby(['market_ids', 'nesting_ids'])
df['demand_instruments20'] = groups['shares'].transform(np.size)
nl_formulation = pyblp.Formulation('0 + prices')
problem = pyblp.Problem(nl_formulation, df)
df.to_csv(root + 'nevo_product_data.csv')
nl_results1 = problem.solve(rho=0.7,
rho_bounds=(None, None),
method = '2s',
center_moments=False)
nl_results1
and the results are:
Problem Results Summary:
======================================================================================
GMM Objective Gradient Clipped Weighting Matrix Covariance Matrix
Step Value Norm Hessian Shares Condition Number Condition Number
---- ---------- ---------- ---------- ------- ---------------- -----------------
2 +1.865E+02 +7.294E-10 +1.074E+04 0 +1.994E+09 +3.029E+04
======================================================================================
Cumulative Statistics:
=================================================
Computation Optimizer Optimization Objective
Time Converged Iterations Evaluations
----------- --------- ------------ -----------
00:00:03 Yes 2 8
=================================================
Rho Estimates (Robust SEs in Parentheses):
============
All Groups
------------
+9.826E-01
(+1.358E-02)
============
Beta Estimates (Robust SEs in Parentheses):
============
prices
------------
-1.186E+00
(+3.973E-01)
============
For sure!
You're still getting slightly different answers because by default, ivreg2 (and most of stata) does not use "robust" covariance matrices, and instead computes covariance matrices under the assumption of homoskedasticity. If you add robust
to the end of your ivreghdfe command, you should get the same answer as pyblp.
By default pyblp doesn't make an assumption about homoskedasticity. But if you want to assume this, you could use se_type='unadjusted'
and W_type='unadjusted'
instead of using the default 'robust'
.
Thank you for the timely reply.
Yes! I got the same answers after making them have the same assumption about homoskedasticity.
This is very helpful! Really appreciate it.
Glad to help!
Hi,
First thing first, I want to thank you for this awesome package!
While using it, I have a question when I try to use the two-step GMM in PyBLP to replicate the two-step efficient GMM results from Stata in the Nested Logit case. I thought they will be the same because there is no ambiguity in theory, the formula is explicit and it looks like on paper the two packages are on the same page. But they yield quite different estimates.
Although the weighting matrix is defined in the same way according to the documentation (maybe different by a scalar, which shouldn't matter), I suspect numerically it is calculated differently. But I am not sure if this is the reason.
It's highly appreciated if you could provide some insights on why they differ. Thanks!
Details are as follows (I tried to hard code the problem in Matlab also, the results from Matlab are the same as those from Stata, so I don't put the details of Matlab here):
1.1. if SE is not clustered at any level.
The results are :
1.2. SE is clustered at the city level.
We fail to get the estimates for 'prices' due to the warning that "Detected that the estimated covariance matrix of aggregate GMM moments is singular with condition number +INF."
2.1. if the SE is not clustered at any level.
The results are:
2.2. if the SE is clustered at the city level.
The results are:
As you can see, the results are quite different. But when I compare one-step GMM in PyBLP and 2SLS in Stata, they yield the same results. I am confused why the difference happens when it comes to the two-step GMM even if in theory they are on the same page.