py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.74k stars 709 forks source link

Recover linear regression results #401

Open gcasamat opened 3 years ago

gcasamat commented 3 years ago

I would like to know if it is possible to recover the estimates of a simple linear regression. I would have thought that fitting a LinearDML algorithm with (1) LinearRegression() as the model_y and model_t and (2) setting cv = 1, would reproduce the OLS estimates. Is this correct? Thanks

vsyrgkanis commented 3 years ago

Yes that is the expected result. Please let us know if you get sth else

gcasamat commented 3 years ago

Thank you. I will check carefully and tell you if I find some discrepancy.

gcasamat commented 3 years ago

I compared the LinearDML "strategy" described above with OLS results from StatsModels. The coefficient on my (binary) treatment variable (pro_rcs) is 4.53 with LinearDML whereas it is 0.1172 with sm.OLS

Here is the code:

est_linear = LinearDML(
                model_y = StatsModelsLinearRegression({'method' : 'qr'}),
                model_t = StatsModelsLinearRegression({'method' : 'qr'}),
                cv = 1,
                discrete_treatment = False,
                fit_cate_intercept = True,
                linear_first_stages = False,
                random_state = 123)
est_linear.fit(Y.values.ravel(), T.values.ravel(), X = None, W = data_for_reg[dum_varlist + cont_varlist + month_dum_list + ['cons']])
results = est_linear.const_marginal_effect_inference()
results.summary_frame(alpha = 0.05, value = 0, decimals = 3)
point_estimate  stderr  zstat   pvalue  ci_lower    ci_upper
0   4.53    0.017   267.176 0.0 4.497   4.563
mod = sm.OLS(Y, pd.concat([T,data_for_reg[dum_varlist + cont_varlist + month_dum_list + ['cons']]], axis = 1))
res = mod.fit(method = 'qr')
print(res.summary())
                       OLS Regression Results                            
==============================================================================
Dep. Variable:               log_rate   R-squared:                       0.603
Model:                            OLS   Adj. R-squared:                  0.603
Method:                 Least Squares   F-statistic:                       nan
Date:                Sun, 07 Feb 2021   Prob (F-statistic):                nan
Time:                        16:42:41   Log-Likelihood:                -18385.
No. Observations:               38616   AIC:                         3.677e+04
Df Residuals:                   38615   BIC:                         3.678e+04
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
======================================================================================================
                                         coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
pro_rcs                                0.1172      0.006     19.414      0.000       0.105       0.129
PropertyType_Bed & Breakfast           0.2689      0.012     21.652      0.000       0.245       0.293
PropertyType_Bungalow                 -0.1216      0.019     -6.443      0.000      -0.159      -0.085
PropertyType_Chalet                   -0.1288      0.017     -7.500      0.000      -0.162      -0.095
PropertyType_Condominium              -0.0011      0.012     -0.089      0.929      -0.025       0.022
PropertyType_Guesthouse                0.2735      0.019     14.629      0.000       0.237       0.310
PropertyType_House                     0.0561      0.005     11.278      0.000       0.046       0.066
PropertyType_Other                    -0.0153      0.012     -1.326      0.185      -0.038       0.007
PropertyType_Vacation home             0.0689      0.020      3.425      0.001       0.029       0.108
PropertyType_Villa                     0.2174      0.008     26.974      0.000       0.202       0.233
ListingType_Private room              -0.2652      0.008    -33.330      0.000      -0.281      -0.250
ListingType_Shared room               -1.0105      0.042    -23.825      0.000      -1.094      -0.927
Superhost_Yes                          0.0112      0.007      1.627      0.104      -0.002       0.025
CancellationPolicy_Moderate           -0.0096      0.007     -1.325      0.185      -0.024       0.005
CancellationPolicy_Strict              0.0995      0.006     18.015      0.000       0.089       0.110
CancellationPolicy_Super strict 30     0.3391      0.028     12.124      0.000       0.284       0.394
CancellationPolicy_Super strict 60     0.2566      0.052      4.952      0.000       0.155       0.358
InstantbookEnabled_Yes                -0.0422      0.004     -9.585      0.000      -0.051      -0.034
listing_age                            0.0010      0.000      5.800      0.000       0.001       0.001
Bedrooms                               0.1387      0.003     40.296      0.000       0.132       0.145
Bathrooms                              0.2177      0.004     57.264      0.000       0.210       0.225
MaxGuests                              0.0241      0.002     13.538      0.000       0.021       0.028
NumberofPhotos                         0.0044      0.000     22.379      0.000       0.004       0.005
host_seniority                      5.476e-21   7.71e-22      7.099      0.000    3.96e-21    6.99e-21
ResponseTime                        5.849e-07   1.42e-07      4.130      0.000    3.07e-07    8.62e-07
ResponseRate                          -0.0004      0.000     -2.452      0.014      -0.001   -8.66e-05
OverallRating                          0.0942      0.005     18.196      0.000       0.084       0.104
MinimumStay                            0.0005      0.000      2.041      0.041    2.14e-05       0.001
month_2                               -0.0185      0.020     -0.940      0.347      -0.057       0.020
month_3                               -0.0159      0.018     -0.864      0.388      -0.052       0.020
month_4                                0.0655      0.015      4.277      0.000       0.036       0.096
month_5                                0.1287      0.015      8.629      0.000       0.099       0.158
month_6                                0.2573      0.015     17.689      0.000       0.229       0.286
month_7                                0.4868      0.014     33.991      0.000       0.459       0.515
month_8                                0.5283      0.014     36.958      0.000       0.500       0.556
month_9                                0.2185      0.015     14.780      0.000       0.189       0.247
month_10                               0.0588      0.016      3.772      0.000       0.028       0.089
month_11                              -0.0015      0.018     -0.085      0.932      -0.037       0.034
month_12                              -0.0092      0.018     -0.498      0.618      -0.045       0.027
cons                                   3.2495      0.033     98.208      0.000       3.185       3.314
==============================================================================
Omnibus:                     4627.155   Durbin-Watson:                   0.621
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35107.433
Skew:                          -0.325   Prob(JB):                         0.00
Kurtosis:                       7.626   Cond. No.                     7.81e+19
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.81e+19. This might indicate that there are
strong multicollinearity or other numerical problems.
vsyrgkanis commented 3 years ago

I believe you might be having colinearity problems in which case the equivalence I believe breaks. The two methods are breaking times among projection solutions in a different manner (i.e. regularizing differently).

i also tried the experiment with some synthetic data and I am getting the same result. Though I didnt use the method=qr spec. Dont think that would be a problem

vsyrgkanis commented 3 years ago

Oh that method thing might actually be the problem. I believe the equivalence might still hold if you use pinv (i.e. minimum norm solution). So I would not specify the method attribute.

gcasamat commented 3 years ago

I tried without specifying the method and obtained the same result from LinearDML :

    point_estimate  stderr  zstat   pvalue  ci_lower    ci_upper
0   4.53    0.017   267.176 0.0 4.497   4.563

I forgot to mention that I obtain the following message after fit (don't know if it is helpful): Co-variance matrix is undertermined. Inference will be invalid!

Yes, I have some multicollinearity issue. That's the reason why I used the 'qr' method in StatsModels. It allows to reproduce the outcome from regressions in Stata. Without specifying this method, I get "crazy" results with StatsModels.

Do you have some suggestion for dealing with this multicollinearity? The only one I am aware of is to drop variables with large VIF. Otherwise, from my experience, using the QR decomposition allows to obtain reasonable coefficient estimates. That's why I used this method.

vsyrgkanis commented 3 years ago

I would try using lineardml with lassocv (the default) as residualizers. (Or elasticnetcv). That should take care of multicolonearity. Also definitely use cv=3 or 5.

the issue in linear dml is most probably that your residuals are exactly zero due to overfitting and the result is nonsense.

gcasamat commented 3 years ago

I tried what you suggest and lineardml indeed gives results I believe in.

What I want to do is to compare the outcome of a standard linear regression with the dml approach. And as a consistency check, I wanted to be able to replicate the linear regression outcome with the dml approach (by adopting an appropriate parametrization). Following your comments, this does not seem to be possible, unless I fix the multicollinearity issue in my data.

Many thanks for your help.