py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.67k stars 693 forks source link

Featurizer for ForestDrLearner, and in nuisance models #291

Open lauracosgrove opened 3 years ago

lauracosgrove commented 3 years ago

Hi, Thank you for all the work you do. I have some concept questions if contributors would like to weigh in:

  1. Is there intuition for why featurizer is not implemented for the ForestDRLearner (but is for the other DRLearners?) Is that because featurizers were introduced to add flexible interactions for the linear methods, so were thought to be redundant for the more flexible forest methods?

  2. I ask this question because I want to bring some models into production and so am hoping to move some data processing and feature engineering to a scikit-learn Pipeline (median imputation, one-hot encoding). If transitioning the nuisance models to a Pipeline, would it be at all feasible to pass, e.g., features with null values as W?

  3. Relatedly, is there thought to be a best practice for handling imputation in X (treatment effect modifiers) - beyond removing features that require imputation?

kbattocchi commented 3 years ago
  1. Yes, the reason is that the forest is not constrained to learning a linear model so explicit featurization should not be necessary.
  2. Could you expand on what you mean? Internally, we're roughly just using np.hstack([X, W]) as the input to the first-stage models (assuming neither X nor W is None). So you can use a pipeline as a nuisance model, but its input is going to consist of the entries in both X and W; if you want to do W- or X-specific preprocessing then that would need to happen outside of the model.
  3. Likewise, I would recommend modifying X outside of the estimator as needed and passing the resulting array in its place, particularly because you want to make sure that your two nuisance models and the final model are seeing a consistent view of the features.

Hope that helps.

lauracosgrove commented 3 years ago

Hi, thanks that's helpful!

  1. I was hoping to do some column-name- or column-type based transformations like in the below Pipeline code. The reason I hoped to do this within a pipeline for the nuisance models is that the transformations include imputation for confounder variables, so I wanted to avoid leakage in the selection of the nuisance model with grid search. However, after pressing on I see that the below code with the custom selectors won't work in the nuisance models as they take a pandas datatype and EconML converts to np array, as you said.

  2. That makes sense. I'll transition to modifying X and W outside of the model fit and save the Pipeline object along with the DRForest model for consistent imputation.

  ColumnTransformer(transformers = [
                                 ('ccc_transformer', 
                                          Pipeline(steps = [
                                                          ('num_imputer', SimpleImputer(add_indicator=True, copy=False, fill_value=None, strategy='median')),
                                                          ('num_scaler', StandardScaler()),
                                          ]), compose.make_column_selector(pattern = '(^aaa_)|(ccc)')),
                                  ('numeric_transformer1', 
                                           Pipeline(steps=[
                                               ('num_imputer', SimpleImputer(strategy='constant', fill_value=0, copy=False)),
                                               ('num_scaler', StandardScaler()),
                                           ]), compose.make_column_selector(dtype_include=np.number)),
                                  ('numeric_transformer2', 
                                           Pipeline(steps=[
                                                ('num_imputer', SimpleImputer(strategy='constant', fill_value=0, copy=False)),
                                                ('num_log', FunctionTransformer(np.log1p, validate=False)),
                                                ('num_scaler', StandardScaler()),
                                            ]), compose.make_column_selector(dtype_include=['float64'])),
                                  ('categorical_transformer', Pipeline(steps=[
                                                ('cat_imputer', SimpleImputer(strategy='constant', fill_value='NA', copy=False)),
                                                ('cat_onehot', OneHotEncoder(sparse=False, handle_unknown='ignore')),
                                            ]), compose.make_column_selector(dtype_include=pd.CategoricalDtype)),
                                        ], n_jobs=1, remainder = "passthrough")