Stability of ATE estimates

ankur-tutlani commented 5 days ago

I have a situation where I am getting different ATE estimates with same input dataset and same random seed. If I run today the average ATE number is around 1. If I run after few hours, it increases to 4 or even more. This is for the same treatment (T1) and control value (T0) combination. What could be potentially wrong here? I have one confounder, one treatment and one outcome column. All are continuous. I tried manually passing folds in cv argument but still the stability of estimates is not there. I have tried passing input dataset in a specific order, but again results are not the same. I observed this with both DoubleML and Kernel DML. What should be changed here to get more stable ATE estimates?

from econml.dml import KernelDML
model_y = xgb.XGBRegressor(random_state=578,max_depth=3,n_estimators=100)
model_t = xgb.XGBRegressor(random_state=578,max_depth=3,n_estimators=100)

data1=data1.sort_values(['Y','X']).reset_index(drop=True)

def get_folds(data1, n_splits=10):
    # Calculate the size of each fold
    fold_size = len(data1) // n_splits

    # Create a list to store the indices for each fold
    folds = []

    # Generate the folds
    for i in range(n_splits):
        start_index = i * fold_size
        if i == n_splits - 1:  # Last fold takes the remaining data
            end_index = len(data1)
        else:
            end_index = (i + 1) * fold_size
        test_indices = list(range(start_index, end_index))
        train_indices = list(range(0, start_index)) + list(range(end_index, len(data1)))
        folds.append((train_indices, test_indices))

    return folds

  class CustomCV:
    def __init__(self, folds):
        self.folds = folds

    def __iter__(self):
        return iter(self.folds)

'common_causes2' contains one continuous variable.

for treatment_var in treatment_vars:
    for i in range(10):
        median_dict=median_dicts[i]
        results = []

        for seed in [256344,196334,256154,190331,783475,206736,114695,414272,468332,147567]:

            np.random.seed(seed)
            data1_shuffled = data1.sample(frac=1, random_state=seed).reset_index(drop=True).copy()
            custom_cv = CustomCV(get_folds(data1_shuffled, n_splits=10))

            dml = KernelDML(model_y=model_y, model_t=model_t, discrete_treatment=False, random_state=seed, cv=custom_cv,mc_iters=10)
            dml.fit(Y=data1_shuffled['Y'], T=data1_shuffled[[treatment_var]], X=data1_shuffled[common_causes2])
            dmlestimate = dml.ate(X=data1_shuffled[common_causes2], T0=median_dict[treatment_var]['control_value'],T1=median_dict[treatment_var]['treatment_value'])
            results.append(dmlestimate)
        average_result = np.mean(results)

        foodb=pd.DataFrame({'estimatevalue':average_result},index=[0])
        foodb['treatment_var'] = treatment_var
        foodb['control_value'] = median_dict[treatment_var]['control_value']
        foodb['treatment_value'] = median_dict[treatment_var]['treatment_value']
        db_to_fill=pd.concat([db_to_fill,foodb],ignore_index=True)
        db_to_fill

For DoubleML using the following final model.

from econml.sklearn_extensions.linear_model import StatsModelsRLM
model_final=StatsModelsRLM(fit_intercept=True)

The average value differs a lot. Although there is not much variation in "results". E.g. sometimes I get "results" in the range from 1 to 3. Other times it increases to 8 to 9. that drives the "average_result" value to differ significantly among different runs. This is for the same treatment (T1) and control value (T0) combination. e.g. at one instance with T0=20, T1=25, average value shows 2, while after few hours with the same T0 and T1 values of 20 and 25 respectively, it shows the average value of 10. I am running this on databricks cluster. Is there anything wrong in the arguments specified above?

econml-0.15.1

kbattocchi commented 4 days ago

I don't immediately see anything wrong; I'm not super familiar with DataBricks, but I wonder if maybe they don't guarantee that rows are returned in the same order, or if it's possible that additional rows are added over time?

ankur-tutlani commented 4 days ago

Thanks for response. I sorted the dataframe to ensure the order remains consistent before running DML or KernelDML.

data1=data1.sort_values(['Y','X']).reset_index(drop=True)

I am not sure what you mean by additional rows are added over time? Can you clarify. data1is pandas dataframe.

One thing I observed is nuisance_scores_y and nuisance_scores_t are not consistent. Meaning if I run 10-fold cross validation along with 10 mc_iters, I expect the output (nuisance_scores_y and nuisance_scores_t ) should be a list with length of 10. But sometimes its 10, sometimes it's less than 10 like 3 or 7. Also, the values in the list elements (nuisance_scores_y and nuisance_scores_t ) vary significantly across different runs for the same seed and treatment and control combinations. What can explain this behavior?

kbattocchi commented 2 hours ago

That behavior is very strange: the nuisance scores should always be a nested list where the length of the outer list is mc_iters and the length of each element is the number of folds, and the logic which creates those lists is straightforward (and covered by our tests).

py-why / EconML

Stability of ATE estimates #931