h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Guidance in How Target Encoding Should be Applied with AutoML #8150

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Notes from our disucssion:

Due to the time-constrained nature of AutoML (meaning, that our goal is to get the best model within a fixed time – vs unlimited time), the current best way to use Target Encoding with AutoML is:

split data into four parts using split frame (70/10/10/10): {{train}}, {{valid}}, {{blend}}, {{test}}

generate a TE model on train

apply TE model to train, valid, blend and test to get extended frames: {{train_te}},{{valid_te}}, {{blend_te}}, {{test_te}}

run vanilla automl with {{training_frame = train}}, {{validation_frame = valid}}, {{blending_frame = blend}} and {{leaderboard_frame = test}}. Also make sure to set {{nfolds = 0}} to turn off CV. look at leaderboard metrics

run TE automl with {{training_frame = train_te}}, {{validation_frame = valid_te}}, {{blending_frame = blend_te}} and {{leaderboard_frame = test_te}}. Also make sure to set {{nfolds = 0}} to turn off CV. look at leaderboard metrics

compare lb metrics of vanilla with TE AutoML, and hopefully we see better metrics for the latter!

If time and compute resrouces is not an issue, the safest way to do Target Encoding is within Nested CV (not currently supported in AutoML):

  1. You start you normal cv loop (lets say 4 folds) , you split your data into train and cv
  2. Let's say you have a categorical feature inside your data . What you do now is a 2nd cross validation on the train part only . Let's say this is also 4 folds.
  3. We now have train.train1 and train.cv1 for the first fold. We create averages of the traget based on train.train1 and we apply them to train.cv1 for the categorical feature.
  4. We will do this 4 times until we score ALL train.cvx parts. That way we have generated a variable that uses the mean of the target to transform the whole train dataset.
  5. Now we can use the train set (from the 1st fold) to transform the cv for fold 1.
  6. Now we move to fold 2.... and so forth
exalate-issue-sync[bot] commented 1 year ago

Megan Kurka commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] [~accountid:5b153fb1b0d76456f36daced]

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7488 Assignee: Sebastien Poirier Reporter: Megan Kurka State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A