uber / causalml

Uplift modeling and causal inference with machine learning algorithms
Other
5.08k stars 780 forks source link

Has anyone done HPO? #429

Open mrpega opened 2 years ago

mrpega commented 2 years ago

I have been looking at the codes and thinking of using sklearn's RandomSearchCV but looks like the fit function doesn't allow additional parameters to be passed in:(https://github.com/uber/causalml/blob/84ac51953fa892719a43b15cd9ca1735b13dd114/causalml/inference/meta/tlearner.py#L64)

Has anyone tried to do HPO? Are there any examples around?

Thanks guys~

jeongyoonlee commented 2 years ago

Hi @mrpega, we cannot use sklearn's RandomSearchCV or GridSearchCV because as you said, it expects sklearn's estimator only with X and y as input arguments. It's still possible to use other more flexible HPO libraries such as Optuna or Hyperopt. We don't have an example yet. I think it'd be a good addition to the package, and will discuss with the dev team for the plan to add it.

Also, we'd love to have your contribution if you're interested as well!

paullo0106 commented 2 years ago

@jeongyoonlee I think this can be relevant to #413 that I can find that EconML supports hyperparameters method such as GridSearchCV (example in the usage FAQs) and additional kwargs parameters for fit() with some of the functions and wrapper classes. We can discuss more and I might be able to help with it : )

jeongyoonlee commented 2 years ago

Thanks @paullo0106 for the pointer. Yes, this can be one way to do it: using RandomSearchCV for each of base learners, which are sklearn estimators. I was thinking using Optuna or Hyperopt to optimize a CausalML estimator directly. Let's continue the discussion this Friday.

vferraz commented 1 year ago

If anyone is looking into this, I am trying to use optional combined with cross-validation. But the AUUC might be difficult to optimize. In any case, here is my code example :

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 30, 500, step=5)
    #evaluation_function = trial.suggest_categorical('evaluation_function', ["KL", "ED"])
    max_depth = trial.suggest_int('max_depth', 1, 30, step=1)
    max_features = trial.suggest_int('max_features', 2, len(covariates), step=1) 
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 75, 500, step=10)
    min_samples_treatment = trial.suggest_int('min_samples_treatment', 50, 500, step=5)
    n_reg = trial.suggest_int('n_reg', 1, 20, step = 1)
    #normalization = trial.suggest_categorical('normalization', [True, False])

    uplift_model = UpliftRandomForestClassifier(n_estimators = n_estimators, 
                                                evaluationFunction = 'KL',
                                                max_features = max_features, 
                                                max_depth = max_depth, 
                                                min_samples_leaf = min_samples_leaf, 
                                                min_samples_treatment = min_samples_treatment,
                                                n_reg = n_reg, 
                                                normalization = True,
                                                control_name='Baseline',
                                                n_jobs = -1)
    skf = StratifiedKFold(n_splits=7)
    auucs = []
    for train_index, test_index in skf.split(df[covariates].values, df['delegation_bin'].values):
        df_train = df.iloc[train_index]
        df_test = df.iloc[test_index]

        uplift_model.fit(X=df_train[covariates].values, 
                         treatment=df_train['treat_string'].values, 
                         y=df_train['delegation_bin'].values)

        y_pred = uplift_model.predict(df_test[covariates].values)

        result = pd.DataFrame(y_pred, columns=uplift_model.classes_[1:])

        best_treatment = np.where((result < 0).all(axis=1),
                               'control',
                               result.idxmax(axis=1))

        actual_is_best = np.where(df_test['treat_string'] == best_treatment, 1, 0)
        actual_is_control = np.where(df_test['treat_string'] == 'Baseline', 1, 0)

        synthetic = (actual_is_best == 1) | (actual_is_control == 1)
        synth = result[synthetic]

        auuc_metrics = (synth.assign(is_treated = 1 - actual_is_control[synthetic],
                                 delegation_bin = df_test.loc[synthetic, 'delegation_bin'].values,
                                 uplift_tree = synth.max(axis=1))
                         .drop(columns=list(uplift_model.classes_[1:])))

        auuc = auuc_score(auuc_metrics, outcome_col='delegation_bin', treatment_col='is_treated')
        auucs.append(auuc[0])

    return np.mean(auucs)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials = 100, n_jobs = -1)
print(study.best_params)

I am trying to use this "synthetic" control group, as suggested here in some examples. In any case, suggestions or comments are appreciated.