tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
568 stars 112 forks source link

tidyups: input / output information #1158

Open EmilHvitfeldt opened 1 year ago

EmilHvitfeldt commented 1 year ago

Tidyup: variable input/output information in {recipes}

Champion: Emil

Co-Champion: Max

Status: Draft

Abstract

The recipes provide a pipe-able and flexible way of processing data. Each operation is done sequentially, using tidyselect and recipes specific selectors such as all_numeric_predictors() and all_outcomes().

We want to have a robust collection of which variables are passed into each step, which is passed out and their relation. Having this information will be valuable for the user, allowing us to determine the minimum set of required input, calculate feature importance, generate graphs, and more.

Motivation

It is not known beforehand which variables are used within each step. Take the recipe below

library(recipes)

data(ames, package = "modeldata")

rec <- recipe(~., data = ames) |>
  step_dummy(all_nominal_predictors()) |>
  step_nzv(all_predictors()) |> # remove Near Zero Variance columns
  step_pca(all_predictors())

we would have to run the code to be able to determine which variables are selected by step_pca(), beyond the variables in ames, some are created by step_dummy() and some are removed by step_nzv(), including those created by step_dummy().

knowing the input and output of step_nzv() lets the user know what variables are created.

Using the same recipe, we are given the variables "PC1", "PC2", "PC3", "PC4", "PC5", which are not that useful for the end user if they are interested in variable importance. If we had the input/output information. We would be able to deduce backward, seeing which dummy variables are created, and how much each of those contributes to each component.

Lastly depending on which variables were removed with step_nzv() we might be able to deduce that some variables won't be necessary as input, as they are fully removed.

On the dev side, we will be able to use this information to refactor some of the selecting and name-creating code that happens in many steps.

Solution

Each recipe step is essentially a list of information. Adding this information would be done as another field.

Implementation

This is where we need help!

I'm thinking that this information could be represented as a list of character vectors or as a sparse matrix. Keep in mind that you will need one for each step and that we will want to "combine" these to get an inference of what happens to each variable.

These are talks I'm not very well versed in, and I wouldn't be surprised if there was a igraph function that would do what we need with ease.

Backwards compatibility

Should be trivial, as we are "just" adding another field for each step.

Types of steps

In the above definition, "many" means non-negative.

One to one

one to many

many to many

removing

adding

none to none

all of above

EmilHvitfeldt commented 1 year ago

This problem can be solved much much nicer with this https://github.com/tidymodels/recipes/pull/1199

EmilHvitfeldt commented 1 year ago

This is related to https://github.com/tidymodels/recipes/issues/1137 as well

EmilHvitfeldt commented 5 months ago

Having ptype information is going to make this issue much nicer https://github.com/tidymodels/recipes/pull/1329