DRtester error either with CausalForestDML or with pandas dataframes (or both?) ?

divyadev-vinted commented 4 months ago

Hi! I am trying to use the DRtester from econML to understand whether there actually is heterogeneity in the CATE that I am estimating. As I understand, the DRtester was built as an equivalent function to Athey's test_calibration in R. I followed the example in the notebook you provided (thanks a lot for that btw). While I am able to download and run the linked notebook example without any problems, I am unable to do the same with my own dataframes and/or model specifications.

I am running a CausalForest model with the following setup: X_train = pandas dataframe of user features with 36 columns and ~ 200k rows Y_train = outcome variable with 1 column and ~200k rows T_train = binary treatment variable 1 column and ~ 200k rows.

And similarly I have 3 validation dataframes Xval, Yval and Tval which have the same column structure as the training data but fewer rows as they make-up 20% of the data while the training data is 80%.

If I run the DRtester:

using the same DML model chosen as in the example notebook (meaning take the exactly same code from the example notebook just changing the input data to my dataframes rather than numpy arrays as in the example notebook), or
try to use both my data and the CausalForestDML specification I am working with I get the same error (detailed below) in both cases.

Could you please help me understand why? Or how could I fix it?

And thank you so much for all the improvements and additions you've been making - I greatly appreciate it!

I am estimating the following CausalForest model: est = CausalForestDML(criterion='het', n_estimators=200, #100
min_samples_leaf=0.05, max_depth=5, #with 4 also the noise disappears max_samples=0.5, discrete_treatment=True, honest=True, inference=True, cv=5, model_t=LogisticRegression(), model_y=LassoCV(), )

est.fit(Y_train, T_train, X=X_train, W=None)

Then I tried to use the following in the DRtester: dml_tester = DRtester( model_regression=LassoCV(), model_propensity=LogisticRegression(), cate=est ).fit_nuisance(Xval, Tval, Yval, X_test, T_test, Y_test) res_dml = dml_tester.evaluate_all(X_test, X) res_dml.summary()

This results in the following error:

KeyError Traceback (most recent call last) Cell In[167], line 2 1 # Initialize DRtester and fit/predict nuisance models ----> 2 dml_tester = DRtester( 3 model_regression=LassoCV(), 4 model_propensity=LogisticRegression(), 5 cate=est 6 ).fit_nuisance(X_test, T_test, Y_test, X, T, Y)

File /tmp/jupyter_python_user_libs_divya.dev_7976ffa9-b5b7-4463-bf90-158accfbcf7c/econml/validate/drtester.py:219, in DRtester.fit_nuisance(self, Xval, Dval, yval, Xtrain, Dtrain, ytrain) 215 self.fit_on_train = (Xtrain is not None) and (Dtrain is not None) and (ytrain is not None) 217 if self.fit_on_train: 218 # Get DR outcomes in training sample --> 219 reg_preds_train, prop_preds_train = self.fit_nuisance_cv(Xtrain, Dtrain, ytrain) 220 self.drtrain = calculate_dr_outcomes(Dtrain, ytrain, reg_preds_train, prop_preds_train) 222 # Get DR outcomes in validation sample

File /tmp/jupyter_python_user_libs_divya.dev_7976ffa9-b5b7-4463-bf90-158accfbcf7c/econml/validate/drtester.py:314, in DRtester.fit_nuisance_cv(self, X, D, y) 312 for k in range(self.n_treat + 1): 313 for train, test in splits: --> 314 model_regression_fitted = self.model_regression.fit(X[train][D[train] == self.treatments[k]], 315 y[train][D[train] == self.treatments[k]]) 316 reg_preds[test, k] = model_regression_fitted.predict(X[test]) 318 return reg_preds, prop_preds

File /opt/jupyterhub/kernel_venvs/python38/lib64/python3.8/site-packages/pandas/core/frame.py:3767, in DataFrame.getitem(self, key) 3765 if is_iterator(key): 3766 key = list(key) -> 3767 indexer = self.columns._get_indexer_strict(key, "columns")[1] 3769 # take() does not accept boolean indexers 3770 if getattr(indexer, "dtype", None) == bool:

File /opt/jupyterhub/kernel_venvs/python38/lib64/python3.8/site-packages/pandas/core/indexes/base.py:5877, in Index._get_indexer_strict(self, key, axis_name) 5874 else: 5875 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr) -> 5877 self._raise_if_missing(keyarr, indexer, axis_name) 5879 keyarr = self.take(indexer) 5880 if isinstance(key, Index): 5881 # GH 42790 - Preserve name from an Index

File /opt/jupyterhub/kernel_venvs/python38/lib64/python3.8/site-packages/pandas/core/indexes/base.py:5938, in Index._raise_if_missing(self, key, indexer, axis_name) 5936 if use_interval_msg: 5937 key = list(key) -> 5938 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 5940 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) 5941 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index([ 0, 2, 3, 4, 5, 6, 7, 8, 9,\n 11,\n ...\n 195088, 195089, 195091, 195092, 195096, 195097, 195099, 195101, 195102,\n 195103],\n dtype='int64', length=156083)] are in the [columns]"

fverac commented 4 months ago

This definitely seems to be a bug.

It seems we are indexing arrays in a way that is compatible with numpy but not compatible with pandas.

I think for now you may just have to convert your pandas dataframes to np arrays before passing to DRTester, via the .values attribute. e.g. X_test.values

divyadev-vinted commented 4 months ago

This works, thank you!!

py-why / EconML

DRtester error either with CausalForestDML or with pandas dataframes (or both?) ? #852

This results in the following error: