Featurizer for ForestDrLearner, and in nuisance models

lauracosgrove commented 3 years ago

Hi, Thank you for all the work you do. I have some concept questions if contributors would like to weigh in:

Is there intuition for why featurizer is not implemented for the ForestDRLearner (but is for the other DRLearners?) Is that because featurizers were introduced to add flexible interactions for the linear methods, so were thought to be redundant for the more flexible forest methods?
I ask this question because I want to bring some models into production and so am hoping to move some data processing and feature engineering to a scikit-learn Pipeline (median imputation, one-hot encoding). If transitioning the nuisance models to a Pipeline, would it be at all feasible to pass, e.g., features with null values as W?
Relatedly, is there thought to be a best practice for handling imputation in X (treatment effect modifiers) - beyond removing features that require imputation?

kbattocchi commented 3 years ago

Yes, the reason is that the forest is not constrained to learning a linear model so explicit featurization should not be necessary.
Could you expand on what you mean? Internally, we're roughly just using np.hstack([X, W]) as the input to the first-stage models (assuming neither X nor W is None). So you can use a pipeline as a nuisance model, but its input is going to consist of the entries in both X and W; if you want to do W- or X-specific preprocessing then that would need to happen outside of the model.
Likewise, I would recommend modifying X outside of the estimator as needed and passing the resulting array in its place, particularly because you want to make sure that your two nuisance models and the final model are seeing a consistent view of the features.

Hope that helps.

lauracosgrove commented 3 years ago

Hi, thanks that's helpful!

I was hoping to do some column-name- or column-type based transformations like in the below Pipeline code. The reason I hoped to do this within a pipeline for the nuisance models is that the transformations include imputation for confounder variables, so I wanted to avoid leakage in the selection of the nuisance model with grid search. However, after pressing on I see that the below code with the custom selectors won't work in the nuisance models as they take a pandas datatype and EconML converts to np array, as you said.
That makes sense. I'll transition to modifying X and W outside of the model fit and save the Pipeline object along with the DRForest model for consistent imputation.

  ColumnTransformer(transformers = [
                                 ('ccc_transformer', 
                                          Pipeline(steps = [
                                                          ('num_imputer', SimpleImputer(add_indicator=True, copy=False, fill_value=None, strategy='median')),
                                                          ('num_scaler', StandardScaler()),
                                          ]), compose.make_column_selector(pattern = '(^aaa_)|(ccc)')),
                                  ('numeric_transformer1', 
                                           Pipeline(steps=[
                                               ('num_imputer', SimpleImputer(strategy='constant', fill_value=0, copy=False)),
                                               ('num_scaler', StandardScaler()),
                                           ]), compose.make_column_selector(dtype_include=np.number)),
                                  ('numeric_transformer2', 
                                           Pipeline(steps=[
                                                ('num_imputer', SimpleImputer(strategy='constant', fill_value=0, copy=False)),
                                                ('num_log', FunctionTransformer(np.log1p, validate=False)),
                                                ('num_scaler', StandardScaler()),
                                            ]), compose.make_column_selector(dtype_include=['float64'])),
                                  ('categorical_transformer', Pipeline(steps=[
                                                ('cat_imputer', SimpleImputer(strategy='constant', fill_value='NA', copy=False)),
                                                ('cat_onehot', OneHotEncoder(sparse=False, handle_unknown='ignore')),
                                            ]), compose.make_column_selector(dtype_include=pd.CategoricalDtype)),
                                        ], n_jobs=1, remainder = "passthrough")

py-why / EconML

Featurizer for ForestDrLearner, and in nuisance models #291