exalate-issue-sync[bot] commented 1 year ago

Notes from our disucssion:

Due to the time-constrained nature of AutoML (meaning, that our goal is to get the best model within a fixed time – vs unlimited time), the current best way to use Target Encoding with AutoML is:

split data into four parts using split frame (70/10/10/10): {{train}}, {{valid}}, {{blend}}, {{test}}

generate a TE model on train

apply TE model to train, valid, blend and test to get extended frames: {{train_te}},{{valid_te}}, {{blend_te}}, {{test_te}}

run vanilla automl with {{training_frame = train}}, {{validation_frame = valid}}, {{blending_frame = blend}} and {{leaderboard_frame = test}}. Also make sure to set {{nfolds = 0}} to turn off CV. look at leaderboard metrics

run TE automl with {{training_frame = train_te}}, {{validation_frame = valid_te}}, {{blending_frame = blend_te}} and {{leaderboard_frame = test_te}}. Also make sure to set {{nfolds = 0}} to turn off CV. look at leaderboard metrics

compare lb metrics of vanilla with TE AutoML, and hopefully we see better metrics for the latter!

If time and compute resrouces is not an issue, the safest way to do Target Encoding is within Nested CV (not currently supported in AutoML):

You start you normal cv loop (lets say 4 folds) , you split your data into train and cv
Let's say you have a categorical feature inside your data . What you do now is a 2nd cross validation on the train part only . Let's say this is also 4 folds.
We now have train.train1 and train.cv1 for the first fold. We create averages of the traget based on train.train1 and we apply them to train.cv1 for the categorical feature.
We will do this 4 times until we score ALL train.cvx parts. That way we have generated a variable that uses the mean of the target to transform the whole train dataset.
Now we can use the train set (from the 1st fold) to transform the cv for fold 1.
Now we move to fold 2.... and so forth

exalate-issue-sync[bot] commented 1 year ago

Megan Kurka commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] [~accountid:5b153fb1b0d76456f36daced]

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7488 Assignee: Sebastien Poirier Reporter: Megan Kurka State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

h2oai / h2o-3

Guidance in How Target Encoding Should be Applied with AutoML #8150