Open lauracosgrove opened 3 years ago
np.hstack([X, W])
as the input to the first-stage models (assuming neither X nor W is None). So you can use a pipeline as a nuisance model, but its input is going to consist of the entries in both X and W; if you want to do W- or X-specific preprocessing then that would need to happen outside of the model.Hope that helps.
Hi, thanks that's helpful!
I was hoping to do some column-name- or column-type based transformations like in the below Pipeline code. The reason I hoped to do this within a pipeline for the nuisance models is that the transformations include imputation for confounder variables, so I wanted to avoid leakage in the selection of the nuisance model with grid search. However, after pressing on I see that the below code with the custom selectors won't work in the nuisance models as they take a pandas datatype and EconML converts to np array, as you said.
That makes sense. I'll transition to modifying X and W outside of the model fit and save the Pipeline object along with the DRForest model for consistent imputation.
ColumnTransformer(transformers = [
('ccc_transformer',
Pipeline(steps = [
('num_imputer', SimpleImputer(add_indicator=True, copy=False, fill_value=None, strategy='median')),
('num_scaler', StandardScaler()),
]), compose.make_column_selector(pattern = '(^aaa_)|(ccc)')),
('numeric_transformer1',
Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='constant', fill_value=0, copy=False)),
('num_scaler', StandardScaler()),
]), compose.make_column_selector(dtype_include=np.number)),
('numeric_transformer2',
Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='constant', fill_value=0, copy=False)),
('num_log', FunctionTransformer(np.log1p, validate=False)),
('num_scaler', StandardScaler()),
]), compose.make_column_selector(dtype_include=['float64'])),
('categorical_transformer', Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='NA', copy=False)),
('cat_onehot', OneHotEncoder(sparse=False, handle_unknown='ignore')),
]), compose.make_column_selector(dtype_include=pd.CategoricalDtype)),
], n_jobs=1, remainder = "passthrough")
Hi, Thank you for all the work you do. I have some concept questions if contributors would like to weigh in:
Is there intuition for why featurizer is not implemented for the ForestDRLearner (but is for the other DRLearners?) Is that because featurizers were introduced to add flexible interactions for the linear methods, so were thought to be redundant for the more flexible forest methods?
I ask this question because I want to bring some models into production and so am hoping to move some data processing and feature engineering to a scikit-learn Pipeline (median imputation, one-hot encoding). If transitioning the nuisance models to a Pipeline, would it be at all feasible to pass, e.g., features with null values as W?
Relatedly, is there thought to be a best practice for handling imputation in X (treatment effect modifiers) - beyond removing features that require imputation?