DMLCateEstimator with binary (logistic) outcome

Hi, thanks for this useful package!

I am modeling a scenario with a binary outcome using logistic regression for model_y. I get strange coefficients due to how the final model regression is set up in DMLCateEstimator.

Say \hat{Y} and \hat{T} are generated from model_y and model_t respectively. DMLCateEstimator runs a linear regression of Y - \hat{Y} on T - \hat{T} and reports the coefficient on T - \hat{T}

When the outcome is modeled via logistic regression, wouldn't running logistic regression of Y on \hat{Y} + (T - \hat{T}) and reporting the coefficient of T - \hat{T} be more appropriate? Changing the final model to logistic regression does not fix things: it does not even run, since Y - \hat{Y} is not binary.

Below are some experiments that replicate this issue.

I generated synthetic data with a binary outcome, continuous treatment x1 and one covariate x2. The following is the statsmodels output of logistic regression on my synthetic data. The true coefficient of interest on x1 = 0.05:

y = simulated_data[:, 0]
X = sm.add_constant(simulated_data[:, 1:3])
model = sm.Logit(endog=y, exog=X)
res = model.fit(method="bfgs", maxiter=2000)
print(res.summary())

Optimization terminated successfully.
         Current function value: 0.580860
         Iterations: 15
         Function evaluations: 18
         Gradient evaluations: 18
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                10901
Model:                          Logit   Df Residuals:                    10898
Method:                           MLE   Df Model:                            2
Date:                Wed, 18 Dec 2019   Pseudo R-squ.:                  0.1586
Time:                        01:32:38   Log-Likelihood:                -6332.0
converged:                       True   LL-Null:                       -7525.8
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.4981      0.021    -23.225      0.000      -0.540      -0.456
x1             0.0441      0.004     10.389      0.000       0.036       0.052
x2             5.2600      0.304     17.329      0.000       4.665       5.855
==============================================================================

I used the following code to fit a DML estimator on the same data. Note that I'm not using any regularization, so the resulting coefficient should be similar to those on x1 above (it is not):

est = DMLCateEstimator(model_y=LogisticRegression(penalty="none", solver="lbfgs"),
                       model_t=LinearRegression(),
                       linear_first_stages=False,
                       model_final=LinearRegression(),
                       n_splits=2)
est.fit(Y=Y.ravel(), T=D, W=X, inference="bootstrap")

te_pred = est.effect()
lb, ub = est.effect_interval(alpha=0.05)
print("Coef:", te_pred[0])
print("95% CI:", (lb[0], ub[0]))

Coef: 0.009016575090856738
95% CI: (0.007533168452956939, 0.011099341017661316)

The following code replicates the DMLCateEstimator naively and obtains a coefficient similar to the one above:

model_t = LinearRegression().fit(X=X, y=D)  # model_t
model_y = LogisticRegression(penalty="none", solver="lbfgs").fit(X=X, y=Y.ravel())  # model_y
yres = Y - model_y.predict(X)
tres = D - model_t.predict(X)

model = sm.OLS(endog=yres, exog=tres)
res = model.fit(method="pinv", maxiter=100)
print(res.summary())

                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                   0.006
Model:                            OLS   Adj. R-squared (uncentered):              0.006
Method:                 Least Squares   F-statistic:                              67.71
Date:                Fri, 20 Dec 2019   Prob (F-statistic):                    2.11e-16
Time:                        02:25:20   Log-Likelihood:                         -9482.5
No. Observations:               10901   AIC:                                  1.897e+04
Df Residuals:                   10900   BIC:                                  1.897e+04
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0090      0.001      8.229      0.000       0.007       0.011
==============================================================================
Omnibus:                    84737.736   Durbin-Watson:                   1.320
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1745.940
Skew:                           0.668   Prob(JB):                         0.00
Kurtosis:                       1.566   Cond. No.                         1.00
==============================================================================

The following code modifies the final regression and obtains more sensible estimates:

model_t = LinearRegression().fit(X=X, y=D)  # model_t
model_y = LogisticRegression(penalty="none", solver="lbfgs").fit(X=X, y=Y.ravel())  # model_y
yres = Y - model_y.predict(X)
tres = D - model_t.predict(X)

model = sm.Logit(endog=Y, exog=np.hstack((model_y.predict(X).reshape(-1,1), tres.reshape(-1,1))))
res = model.fit(method="lbfgs", maxiter=100)
print(res.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                10901
Model:                          Logit   Df Residuals:                    10899
Method:                           MLE   Df Model:                            1
Date:                Fri, 20 Dec 2019   Pseudo R-squ.:                  0.1241
Time:                        02:37:26   Log-Likelihood:                -6591.7
converged:                       True   LL-Null:                       -7525.8
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             4.8667      0.303     16.074      0.000       4.273       5.460
x2             0.0417      0.004     10.111      0.000       0.034       0.050
==============================================================================

Did you get an answer for this? I have a binary outcome problem as well want to know how to calculate residual.

I ended up just implementing the partially linear double ML IV estimator from (Chernozukov et. al.). It does not assume any distributional form for the noise/error term, so I think it could be used as is for binary outcomes. My simulations seem to confirm this.

Are you referring to this?

Y =Dθ0 +g0(X)+U, EP[U |X,Z]=0, (4.56) Z =m0(X)+V, EP[V |X]=0,

First off all, mine problem doesn’t have a Z. But even if I do, I still need to estimate U which require some type of Y - Y_hat subtraction. If Y is binary, we have the trouble of {0, 1} - Prob. How did get around it?

On Thu, May 14, 2020 at 10:19 AM Emaad Ahmed Manzoor < notifications@github.com> wrote:

I ended up just implementing the partially linear double ML IV estimator from (Chernozukov et. al.). It does not assume any distributional form for the noise/error term, so I think it could be used as is for binary outcomes. My simulations seem to confirm this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/204#issuecomment-628775146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6JOMCVNCMTT2Q25KRXOV3RRQR2JANCNFSM4J5ZNZ5Q .

--

Ken Lee

Sorry I confused this issue with a different one. I think econml implements this correctly for binary outcomes (there are no changes required), and I was interpreting the results incorrectly.

The DMLCateEstimator will estimate average marginal effects. I compared these with the logit coefficients, which are the log odds ratios and not the marginal effects.

The following simulation confirms that DMLCateEstimator works as expected:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from econml.dml import DMLCateEstimator
from sklearn.linear_model import LinearRegression, LogisticRegression

def sigmoid(x):
    return 1.0/(1.0 + np.exp(-x))

logit_margeff_x1 = []
logit_margeff_x2 = []
logit_coef_x1 = []
logit_coef_x2 = []
dml_x1 = []
dml_x2 = []

for sim in tqdm(range(1000)):
    data = []
    for row in range(1000):
        x1 = np.random.normal(5.0, 1.0)
        x2 = np.random.normal(2.0, 1.0)

        Y_prob = sigmoid(-4.0 + 0.5 * x1 + 0.25 * x2)
        Y = np.random.choice([0, 1], p=[1-Y_prob, Y_prob])

        data.append([Y, x1, x2])

    data = pd.DataFrame(data, columns=["Y", "x1", "x2"])

    # logit average marginal effects
    logit_model = smf.logit("Y ~ x1 + x2", data=data)
    logit_results = logit_model.fit(disp=0)
    x1_coef, x2_coef= logit_results.params.x1, logit_results.params.x2
    x1_margeff, x2_margeff = logit_results.get_margeff().margeff

    logit_margeff_x1.append(x1_margeff)
    logit_margeff_x2.append(x2_margeff)
    logit_coef_x1.append(x1_coef)
    logit_coef_x2.append(x2_coef)

    # dml estimates
    est = DMLCateEstimator(model_y=LogisticRegression(penalty="none", solver="lbfgs"),
                           model_t=LinearRegression(),
                           linear_first_stages=False,
                           model_final=LinearRegression(fit_intercept=False),
                           fit_cate_intercept=True,
                           n_splits=4)
    est.fit(Y=data["Y"].values, T=data[["x1", "x2"]].values, W=None, inference=None)
    te_pred = est.const_marginal_effect()[0]
    dml_x1.append(te_pred[0])
    dml_x2.append(te_pred[1])

plt.hist(logit_coef_x1, bins=25)
plt.grid()
plt.title("logit coef x1")
plt.show()

plt.hist(logit_coef_x2, bins=25)
plt.grid()
plt.title("logit coef x2")
plt.show()

plt.hist(logit_margeff_x1, bins=25)
plt.grid()
plt.title("logit margeff x1")
plt.show()

plt.hist(logit_margeff_x2, bins=25)
plt.grid()
plt.title("logit margeff x2")
plt.show()

plt.hist(dml_x1, bins=25)
plt.grid()
plt.title("dml x1")
plt.show()

plt.hist(dml_x2, bins=25)
plt.grid()
plt.title("dml x2")
plt.show()

The distributions of the logit coefficients is as expected: logit-coef-x1 logit-coef-x2

The distributions of the logit average marginal effects are what DMLCateEstimator should converge to: logit-margeff-x1 logit-margeff-x2

The DMLCateEstimator estimates do indeed converge to the logit average marginal effects: dml-x1 dml-x2

Thanks @emaadmanzoor and @realkenlee for the discussion!

1) The easiest way to handle binary outcome if you want to use a classifer for E[Y|X], would be to build a regression wrapper and pass it as model_y: i.e. some example code

from econml.dml import LinearDMLCateEstimator
from sklearn.linear_model import LogisticRegression
class RegWrapper:
    def __init__(self, classifier):
        self.classifier = classifier
    def fit(self, X, y, **kwargs):
        return self.classifier.fit(X, y, **kwargs)
    def predict(self, X):
        return self.classifier.predict_proba(X)[:, 1]

est = LinearDMLCateEstimator(model_y=RegWrapper(LogisticRegression()),
                             model_t=LogisticRegression(),
                             discrete_treatment=True)

n = 1000
X = np.random.uniform(-1, 1, size=(n, 1))
D = np.random.binomial(1, .5 + .1*X[:, 0], size=(n,))
Y = np.random.binomial(1, .5 + .2*D + .1*X[:, 0], size=(n,))
est.fit(Y, D, W=X, inference='statsmodels')
est.summary()

If you want to get more fancy and expose all the attributes of the classifier as attributes of the wrapper, you can use this version we implemented in our dmliv prototype: https://github.com/microsoft/EconML/blob/3606b0bcc7779b78e6df8991dbcd7b72ac3046ef/prototypes/dml_iv/utilities.py#L117 With this version you can access the coef_ of the underlying classifier as:

est = RegWrapper(LogisticRegression()).fit(X, y)
est.coef_

while with the simpler version above you would need to do:

est.classifier.coef_

2) It's not clear when the above approach would be theoretically sound (though I suspect it might be ok in practice). Observe that the theory requires that the model is linear in D, i.e.:

Y = D*theta(X) + g(X) + epsilon
D = p(X) + eta

E[Y | X] = p(X) * theta(X) + g(X)

However, logistic regression assumes that:

E[Y | X] = logistic( <q, X>)

It's not clear that there are natural primitive assumptions on p, theta and g, so that you can write the former as the latter. For instance, if theta(X)=theta is a constant, then the above assumes that: p(X)+g(X) = logistic(<q, X>), whic means that somehow the propensity p(X) is related with the confounding effect g(X) via the latter relationship. So from a theoretical perspective it might just be better to just use a linear probability model for Y (i.e. just use a lasso for model_y even if Y is binary).

For a binary outcome, it is more reasonable to assume the model:

E[Y | D, X] = logistic( D* theta(X) + g(X) ) 
D = p(X) + eta

But this is a non-linear equation for Y and requires different theoretical arguments for orthogonalization. For instance check out Section 2.1, p. 16 of: https://arxiv.org/pdf/1806.04823.pdf where we derive orthogonal moments for such a non-linear logistic Y model. This is not implemented in our library.

Thanks! So let me rewrite it in this way.

E[Y | D, X] = logistic( D* theta(X) + g(X) + U ) D = p(X) + V

Where E[U|X,D] = 0 and E [V | X] = 0

Estimating p should be simple regression problem. How should one estimate g ? What is the right approach here? More concretely, how do one ultimately get to “residual U” so we can manipulate it downstream?

On Fri, May 15, 2020 at 5:43 AM vsyrgkanis notifications@github.com wrote:

Thanks @emaadmanzoor https://github.com/emaadmanzoor and @realkenlee https://github.com/realkenlee for the discussion!

The easiest way to handle binary outcome if you want to use a classifer for E[Y|X], would be to build a regression wrapper and pass it as model_y: i.e. some example code

from econml.dml import LinearDMLCateEstimatorfrom sklearn.linear_model import LogisticRegressionclass RegWrapper: def init(self, classifier): self.classifier = classifier def fit(self, X, y, kwargs): return self.classifier.fit(X, y, kwargs) def predict(self, X): return self.classifier.predict_proba(X)[:, 1] est = LinearDMLCateEstimator(model_y=RegWrapper(LogisticRegression()), model_t=LogisticRegression(), discrete_treatment=True)

n = 1000X = np.random.uniform(-1, 1, size=(n, 1))D = np.random.binomial(1, .5 + .1X[:, 0], size=(n,))Y = np.random.binomial(1, .5 + .2D + .1*X[:, 0], size=(n,))est.fit(Y, D, W=X, inference='statsmodels')est.summary()

[image: image] https://user-images.githubusercontent.com/13246065/82051094-458f6c00-9687-11ea-8ceb-866904b832d4.png

If you want to get more fancy and expose all the attributes of the classifier as attributes of the wrapper, you can use this version we implemented in our dmliv prototype:

https://github.com/microsoft/EconML/blob/3606b0bcc7779b78e6df8991dbcd7b72ac3046ef/prototypes/dml_iv/utilities.py#L117 With this version you can access the coef_ of the underlying classifier as:

est = RegWrapper(LogisticRegression()).fit(X, y)est.coef_

while with the simpler version above you would need to do:

est.classifier.coef_

It's not clear when the above approach would be theoretically sound (though I suspect it might be ok in practice). Observe that the theory requires that the model is linear in D, i.e.:

Y = D*theta(X) + g(X) + epsilon D = p(X) + eta

So

E[Y | X] = p(X) * theta(X) + g(X)

However, logistic regression assumes that:

E[Y | X] = logistic( <q, X>)

It's not clear that there are natural primitive assumptions on p, theta and g, so that you can write the former as the latter. For instance, if theta(X)=theta is a constant, then the above assumes that: p(X)+g(X) = logistic(<q, X>), whic means that somehow the propensity p(X) is related with the confounding effect g(X) via the latter relationship. So from a theoretical perspective it might just be better to just use a linear probability model for Y (i.e. just use a lasso for model_y even if Y is binary).

For a binary outcome, it is more reasonable to assume the model:

E[Y | D, X] = logistic( D* theta(X) + g(X) ) D = p(X) + eta

But this is a non-linear equation for Y and requires different theoretical arguments for orthogonalization. For instance check out Section 2.1, p. 16 of: https://arxiv.org/pdf/1806.04823.pdf where we derive orthogonal moments for such a non-linear logistic Y model. This is not implemented in our library.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/204#issuecomment-629214516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6JOMCCSWRJQ2FSP5PPASLRRU2IPANCNFSM4J5ZNZ5Q .

--

Ken Lee

@realkenlee The orthogonal moment as described in the paper I noted, has a different structure and does not boil down to a residual on residual regression. So indeed there is no analogue to the residual of Y as you note.

Since that method is not yet implemented and has not been stress tested I'm not sure it's the best option but you can try. You can check the code from the github of that paper. In particular this part of the code implements the orthogonal estimation method for the logistic model that you described: https://github.com/vsyrgkanis/plugin_regularized_estimation/blob/1bcbad4803b2b7834477ab39051d40f7758c408b/logistic_te.py#L104

@vsyrgkanis

Sorry that I seem to re-open this closed issue... 2 questions on this topic:

You mentioned that you and the team derived a Double ML version for binary outcome, as specified in this paper (https://arxiv.org/pdf/1806.04823.pdf). You also pointed out a script to implement that Double ML binary method: https://github.com/vsyrgkanis/plugin_regularized_estimation/blob/1bcbad4803b2b7834477ab39051d40f7758c408b/logistic_te.py#L104

However, you mentioned "Since that method is not yet implemented and has not been stress tested I'm not sure it's the best option but you can try."

Is this script still in beta stage? i.e. we are not suggested to use? Or we can tweak that binary outcome Double ML code you shared now? E.g. Would it be a good option for me to follow your repository for your logistic + double ML paper? https://github.com/vsyrgkanis/plugin_regularized_estimation/tree/1bcbad4803b2b7834477ab39051d40f7758c408b

In your repo's README (), I looked at the section "ORTHOPY library", where it looks promising to use class LogisticWithOffsetAndGradientCorrection(), that "is an estimator adhering to the fit and predict specification of sklearn that enables fitting an "orthogonal" logistic regression"? Can I then specify it in Orthogonal/Double ML method like the following?


est = DMLCateEstimator(model_y=LogisticWithOffsetAndGradientCorrection(),
                       model_t=sklearn_classifier(),
                       model_final=sklearn_linear_regression())

If above looks reasonable, then how exactly would the last-stage "residual of Y ~ residual of treatment" look like? It would look like a logistic regression as well for the last-stage residual regression?

Does Doubly Robust estimator also have this concern? I.e. what's the best practice if we want to use econml.drlearner for binary outcome? Would a logistic regression in the first-stage model_y violate the linear assumption?

I found this Doubly Robust estimator class tutorial slides, to specify doubly robust estimator for logistic regression: https://www4.stat.ncsu.edu/~davidian/double.pdf. However, beyond this slide, most other papers I found about doubly robust, are specified for continuous outcome variable Y, and in Microsoft EconML package user guide for Doubly Robust Learner, can't find anywhere to specify binary outcome variable when using doubly robust method: https://econml.azurewebsites.net/spec/estimation/dr.html. So, would appreciate it if I can learn more about that.

(I know we can use the RegWrapper util function as you mentioned above, but just want to know if doing this would also be theoretically solid for the causal interpretation using Doubly Robust learner).

Thanks!

Hi! Follow up on this - is there a latest Double ML estimator or flag that I should use in the latest 0.9.0b1 beta release (release notes)?

Or should I keep using the RegWrapper class provided by @vsyrgkanis in #204?

from econml.dml import LinearDMLCateEstimator
from sklearn.linear_model import LogisticRegression
class RegWrapper:
    def __init__(self, classifier):
        self.classifier = classifier
    def fit(self, X, y, **kwargs):
        return self.classifier.fit(X, y, **kwargs)
    def predict(self, X):
        return self.classifier.predict_proba(X)[:, 1]

est = LinearDMLCateEstimator(model_y=RegWrapper(LogisticRegression()),
                             model_t=LogisticRegression(),
                             discrete_treatment=True)

@raylinz That is not something that we have addressed in this release, so that's probably still your best bet.

Sorry for commenting again in this closed issue, but my understanding is that DML refers to continuous outcome and not in a binary one (due to the structural equation formulation). Am I misunderstanding something here? Logistic regression models a bernulli variable, while the structural equation models a variable with a distribution depending on epsilon

Sorry for commenting again in this closed issue, but my understanding is that DML refers to continuous outcome and not in a binary one (due to the structural equation formulation). Am I misunderstanding something here? Logistic regression models a bernulli variable, while the structural equation models a variable with a distribution depending on epsilon

I am a student of causal inference, so still gaining momentum here. Can somebody point me in a direction of a modeling approach where binary outcome is appropriate?

I'd also appreciate some insight here. It seems like there are theoretical (not practical) difficulties with binary outcomes in the DML and DR frameworks, but it seems like meta learners would be okay.

py-why / EconML

DMLCateEstimator with binary (logistic) outcome #204