Hello @bradleyboehmke and @bgreenwell. I'm reading through this book and it is fantastic!
I have a little question in Chapter 3, where the book introduces a feature engineering workflow using the recipes package and emphasis that we should create preprocessing blueprint but apply it later and within each resample. This workflow can be perfectly embedded in caret as mentioned in section 3.8.3:
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
My question is whether this principle is also implemented by other machine learning packages such as h2o. Because in Chapter 15 Stacked Models, 15.1 Prerequisites, the training and test set were prepared before runningh2o training process, not in each resample:
# Make sure we have consistent categorical levels
blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_other(all_nominal(), threshold = 0.005)
# Create training & test sets for h2o
train_h2o <- prep(blueprint, training = ames_train, retain = TRUE) %>%
juice() %>%
as.h2o()
test_h2o <- prep(blueprint, training = ames_train) %>%
bake(new_data = ames_test) %>%
as.h2o()
I was wondering, does this violate the principle that we should do the preprocessing within each resample? Do other packages seldom implement this principle except for caret? Besides, Chapter 3 introduces so many steps of feature engineering. Does h2o's AutoML handle these steps automatically?
Hello @bradleyboehmke and @bgreenwell. I'm reading through this book and it is fantastic!
I have a little question in Chapter 3, where the book introduces a feature engineering workflow using the
recipes
package and emphasis that we should create preprocessing blueprint but apply it later and within each resample. This workflow can be perfectly embedded incaret
as mentioned in section 3.8.3:My question is whether this principle is also implemented by other machine learning packages such as
h2o
. Because in Chapter 15 Stacked Models, 15.1 Prerequisites, the training and test set were prepared before runningh2o
training process, not in each resample:I was wondering, does this violate the principle that we should do the preprocessing within each resample? Do other packages seldom implement this principle except for
caret
? Besides, Chapter 3 introduces so many steps of feature engineering. Doesh2o
's AutoML handle these steps automatically?Your kind guidance would be much appreciated!