Best practices to handle NaN/missing values in W or X features?

ghost commented 3 years ago

Hi! Thanks for developing this powerful package. I noticed that with the Orthogonal/Double Machine Learning estimators, they do not accept any missing/NaN values in my features (W & X), even if I specified my 1st-stage Y_model& T_model to be Xgboost classifiers/regressors imported from the xgboost package which the xgboost package implementation alone supports / accepts NaN feature values by branching along NaN as a category.

If I do not impute and call Double ML estimators, I will bump into the error shown in the attached screenshot. Looks like the error is caused by calling a sklearn validation.py script that does not accept missing values?

Based on your hands-on experience, what would be the best practices to impute the NaN in my W & X features? Take the median value of a feature or 0 if it's continuous variable? Or simply replace NaNs with string value 'nan' so that the features might be just considered as a categorical feature when fitting XGBoost or other boosting trees such as CatBoost or LiteGBM?

Or can you help point me to the source code where this can be modified so that the Double ML estimators can accept NaN by default?

Thank you! Screen Shot 2021-02-19 at 12 41 05 PM

morelandjs commented 3 years ago

Also running into the same issue. XGBoost can compute the propensity scores and/or regress the mean response variable, even if some of the confounder values are null. This check seems to be blocking the use of XGBoost with null values unnecessarily.

mbessier commented 3 years ago

Also running on the same issue using LightGBM. This forces me to impute missing values unnecessarily and may even impact the performance of my models due to bad imputations.

esbraun commented 2 years ago

I agree it's important for econml to accept missing values to support algorithms that directly handle missing values (i.e. most notably xgboost, lightgbm and catboost). Forcing the imputation of missing values is non-optimal in many circumstances. causalml already supports this functionality.

dsteinberg commented 2 years ago

Agreed with @esbraun - even if we could use scikit learn pipeline estimators with imputation stages this will help greatly. You could imagine a form of multiple imputation using this strategy with more MC steps in the various metalearners. All that would need to change would be to allow NaN to pass through to the input estimators, no need to call the scikit learn checks too early like here:

https://github.com/microsoft/EconML/blob/7dd7683c987018511a07318b6f4b165018373aad/econml/utilities.py#L544

The underlying estimator could handle the inputs when appropriate.

mshijie commented 2 years ago

Any update on this issure? I'm using catboost as underling model. Which also support NaN feature values.

dsteinberg commented 2 years ago

Thinking about this more, I'm concerned that not allowing the propagation of NaNs can actually lead to bias/overconfidence. See issue #664.

olamagnusandersson commented 1 year ago

Hi @moprescu @kbattocchi

Would it be possible to get a solution to this problem? I would very much appreciate it. Maybe just send a warning or a guide in https://econml.azurewebsites.net/spec/estimation/dml.html?

esbraun commented 1 year ago

Seconding @olamagnusandersson. I’d still rather see a warning than an error thrown for the reasons below.

I agree it's important for econml to accept missing values to support algorithms that directly handle missing values (i.e. most notably xgboost, lightgbm and catboost). Forcing the imputation of missing values is non-optimal in many circumstances. causalml already supports this functionality.

vs759 commented 1 year ago

Hi, just wondering if there has been any solution to this problem yet? I am facing the same issues with NaN in X

fverac commented 1 year ago

Thanks all for your feedback. We currently have a PR in progress to enable missing values for W. Enabling missing values in X is less clear, see message from a discussion in our Discord.

I agree that it would be reasonable for us to address this, but note that for many of our estimators this would really only work for W and not for X (e.g. for LinearDML, our second stage model is running a regression of Y_res on (T_res cross X), so any NaNs in X will be a problem even if the first stage model handles them without issue).

py-why / EconML

Best practices to handle NaN/missing values in W or X features? #414