[Bug] fit_cate_incercept argument in econml.dml.DML does not add intercept correctly

julioasotodv commented 3 months ago

Hi!

I believe I have found a bug in the econml.dml.DML class (and potentially others that use the same mechanism).

In theory, when the DML class is instantiated with fit_cate_intercept=True, it should combine:

The residuals from the treatment variables (T)
The remaining features (namely, X)

Into a single feature dataset to train the final (linear) model model_final.

To better leverage the interactions between the treatment residuals T and X, two-way interactions between the variables are computed (the cross-product between them).

Finally, with fit_cate_intercept=True an additional feature with 1s should be added.

Well: the issue here is that right now fit_cate_intercept=True adds first the feature with 1s to X and then the cross-product between X and T is computed. Therefore: we end up with T intercepts, and none of them is a 1s feature. This leads to multicolinearity, and on top of that no true intercept is being generated.

This can be seen here. This is the function used to generate the final feature ser for model_final: https://github.com/py-why/EconML/blob/6219695cd1a6a0ff492a22a5585c15537d5d41a6/econml/dml/dml.py#L139-L154

self._featurizer in L142 will add the intercept feature to X if fit_cate_intercept=True, generating F. And then in L153 the cross product between F and T is computed.

Thank you!

kbattocchi commented 3 months ago

Thanks for reaching out, but this behavior is by design - with a good first stage model_y, the Y residuals should average to (approximately) zero (or else the model would be improved by adding that average), so there would be no point in adding an intercept that is not interacted with the T residuals. Likewise, we don't include the columns of X in the final regression either, which should also have been handled by the first-stage Y model, just the interaction of the featurized Xs (plus the intercept, if enabled) with the T residuals, because we are assuming that the form of the CATE can be expressed as a linear combination of those terms (see the DML section here).

julioasotodv commented 3 months ago

I see, very clear. Thank you!

julioasotodv commented 3 months ago

Actually, I just saw that the same is done for econml.dml.NonParamDML. Given that there is no restriction on how model_final works in this case, is it still required to compute the cross product between T and X? Given than model_final can be a non-linear scikit-learn estimator (such as a Gradient Boosting Regressor), I believe this is done just to keep API homogeneity, right?

Thank you!

kbattocchi commented 3 months ago

Actually, I just saw that the same is done for econml.dml.NonParamDML.

What do you mean by this? NonParamDML doesn't have a fit_cate_intercept attribute at all, nor does it interact the features with the residuals - it just fits an arbitrary final model regressing the quotient of the residuals onto the featuized X.

julioasotodv commented 3 months ago

Hi again,

I just checked NonParamDML again, and it does perform the cross product... However it is done between X and a column of 1s, so you are right: it does not affect whatsoever.

Thank you and sorry for the incovenience

py-why / EconML

[Bug] fit_cate_incercept argument in econml.dml.DML does not add intercept correctly #865