Open gcasamat opened 3 years ago
Yes that is the expected result. Please let us know if you get sth else
Thank you. I will check carefully and tell you if I find some discrepancy.
I compared the LinearDML "strategy" described above with OLS results from StatsModels. The coefficient on my (binary) treatment variable (pro_rcs) is 4.53 with LinearDML whereas it is 0.1172 with sm.OLS
Here is the code:
est_linear = LinearDML(
model_y = StatsModelsLinearRegression({'method' : 'qr'}),
model_t = StatsModelsLinearRegression({'method' : 'qr'}),
cv = 1,
discrete_treatment = False,
fit_cate_intercept = True,
linear_first_stages = False,
random_state = 123)
est_linear.fit(Y.values.ravel(), T.values.ravel(), X = None, W = data_for_reg[dum_varlist + cont_varlist + month_dum_list + ['cons']])
results = est_linear.const_marginal_effect_inference()
results.summary_frame(alpha = 0.05, value = 0, decimals = 3)
point_estimate stderr zstat pvalue ci_lower ci_upper
0 4.53 0.017 267.176 0.0 4.497 4.563
mod = sm.OLS(Y, pd.concat([T,data_for_reg[dum_varlist + cont_varlist + month_dum_list + ['cons']]], axis = 1))
res = mod.fit(method = 'qr')
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: log_rate R-squared: 0.603
Model: OLS Adj. R-squared: 0.603
Method: Least Squares F-statistic: nan
Date: Sun, 07 Feb 2021 Prob (F-statistic): nan
Time: 16:42:41 Log-Likelihood: -18385.
No. Observations: 38616 AIC: 3.677e+04
Df Residuals: 38615 BIC: 3.678e+04
Df Model: 0
Covariance Type: nonrobust
======================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------
pro_rcs 0.1172 0.006 19.414 0.000 0.105 0.129
PropertyType_Bed & Breakfast 0.2689 0.012 21.652 0.000 0.245 0.293
PropertyType_Bungalow -0.1216 0.019 -6.443 0.000 -0.159 -0.085
PropertyType_Chalet -0.1288 0.017 -7.500 0.000 -0.162 -0.095
PropertyType_Condominium -0.0011 0.012 -0.089 0.929 -0.025 0.022
PropertyType_Guesthouse 0.2735 0.019 14.629 0.000 0.237 0.310
PropertyType_House 0.0561 0.005 11.278 0.000 0.046 0.066
PropertyType_Other -0.0153 0.012 -1.326 0.185 -0.038 0.007
PropertyType_Vacation home 0.0689 0.020 3.425 0.001 0.029 0.108
PropertyType_Villa 0.2174 0.008 26.974 0.000 0.202 0.233
ListingType_Private room -0.2652 0.008 -33.330 0.000 -0.281 -0.250
ListingType_Shared room -1.0105 0.042 -23.825 0.000 -1.094 -0.927
Superhost_Yes 0.0112 0.007 1.627 0.104 -0.002 0.025
CancellationPolicy_Moderate -0.0096 0.007 -1.325 0.185 -0.024 0.005
CancellationPolicy_Strict 0.0995 0.006 18.015 0.000 0.089 0.110
CancellationPolicy_Super strict 30 0.3391 0.028 12.124 0.000 0.284 0.394
CancellationPolicy_Super strict 60 0.2566 0.052 4.952 0.000 0.155 0.358
InstantbookEnabled_Yes -0.0422 0.004 -9.585 0.000 -0.051 -0.034
listing_age 0.0010 0.000 5.800 0.000 0.001 0.001
Bedrooms 0.1387 0.003 40.296 0.000 0.132 0.145
Bathrooms 0.2177 0.004 57.264 0.000 0.210 0.225
MaxGuests 0.0241 0.002 13.538 0.000 0.021 0.028
NumberofPhotos 0.0044 0.000 22.379 0.000 0.004 0.005
host_seniority 5.476e-21 7.71e-22 7.099 0.000 3.96e-21 6.99e-21
ResponseTime 5.849e-07 1.42e-07 4.130 0.000 3.07e-07 8.62e-07
ResponseRate -0.0004 0.000 -2.452 0.014 -0.001 -8.66e-05
OverallRating 0.0942 0.005 18.196 0.000 0.084 0.104
MinimumStay 0.0005 0.000 2.041 0.041 2.14e-05 0.001
month_2 -0.0185 0.020 -0.940 0.347 -0.057 0.020
month_3 -0.0159 0.018 -0.864 0.388 -0.052 0.020
month_4 0.0655 0.015 4.277 0.000 0.036 0.096
month_5 0.1287 0.015 8.629 0.000 0.099 0.158
month_6 0.2573 0.015 17.689 0.000 0.229 0.286
month_7 0.4868 0.014 33.991 0.000 0.459 0.515
month_8 0.5283 0.014 36.958 0.000 0.500 0.556
month_9 0.2185 0.015 14.780 0.000 0.189 0.247
month_10 0.0588 0.016 3.772 0.000 0.028 0.089
month_11 -0.0015 0.018 -0.085 0.932 -0.037 0.034
month_12 -0.0092 0.018 -0.498 0.618 -0.045 0.027
cons 3.2495 0.033 98.208 0.000 3.185 3.314
==============================================================================
Omnibus: 4627.155 Durbin-Watson: 0.621
Prob(Omnibus): 0.000 Jarque-Bera (JB): 35107.433
Skew: -0.325 Prob(JB): 0.00
Kurtosis: 7.626 Cond. No. 7.81e+19
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.81e+19. This might indicate that there are
strong multicollinearity or other numerical problems.
I believe you might be having colinearity problems in which case the equivalence I believe breaks. The two methods are breaking times among projection solutions in a different manner (i.e. regularizing differently).
i also tried the experiment with some synthetic data and I am getting the same result. Though I didnt use the method=qr spec. Dont think that would be a problem
Oh that method thing might actually be the problem. I believe the equivalence might still hold if you use pinv (i.e. minimum norm solution). So I would not specify the method attribute.
I tried without specifying the method and obtained the same result from LinearDML :
point_estimate stderr zstat pvalue ci_lower ci_upper
0 4.53 0.017 267.176 0.0 4.497 4.563
I forgot to mention that I obtain the following message after fit (don't know if it is helpful):
Co-variance matrix is undertermined. Inference will be invalid!
Yes, I have some multicollinearity issue. That's the reason why I used the 'qr' method in StatsModels. It allows to reproduce the outcome from regressions in Stata. Without specifying this method, I get "crazy" results with StatsModels.
Do you have some suggestion for dealing with this multicollinearity? The only one I am aware of is to drop variables with large VIF. Otherwise, from my experience, using the QR decomposition allows to obtain reasonable coefficient estimates. That's why I used this method.
I would try using lineardml with lassocv (the default) as residualizers. (Or elasticnetcv). That should take care of multicolonearity. Also definitely use cv=3 or 5.
the issue in linear dml is most probably that your residuals are exactly zero due to overfitting and the result is nonsense.
I tried what you suggest and lineardml indeed gives results I believe in.
What I want to do is to compare the outcome of a standard linear regression with the dml approach. And as a consistency check, I wanted to be able to replicate the linear regression outcome with the dml approach (by adopting an appropriate parametrization). Following your comments, this does not seem to be possible, unless I fix the multicollinearity issue in my data.
Many thanks for your help.
I would like to know if it is possible to recover the estimates of a simple linear regression. I would have thought that fitting a LinearDML algorithm with (1) LinearRegression() as the model_y and model_t and (2) setting cv = 1, would reproduce the OLS estimates. Is this correct? Thanks