Open EmilHvitfeldt opened 1 year ago
This problem can be solved much much nicer with this https://github.com/tidymodels/recipes/pull/1199
This is related to https://github.com/tidymodels/recipes/issues/1137 as well
Having ptype
information is going to make this issue much nicer https://github.com/tidymodels/recipes/pull/1329
Tidyup: variable input/output information in {recipes}
Champion: Emil
Co-Champion: Max
Status: Draft
Abstract
The recipes provide a pipe-able and flexible way of processing data. Each operation is done sequentially, using tidyselect and recipes specific selectors such as
all_numeric_predictors()
andall_outcomes()
.We want to have a robust collection of which variables are passed into each step, which is passed out and their relation. Having this information will be valuable for the user, allowing us to determine the minimum set of required input, calculate feature importance, generate graphs, and more.
Motivation
It is not known beforehand which variables are used within each step. Take the recipe below
we would have to run the code to be able to determine which variables are selected by
step_pca()
, beyond the variables inames
, some are created bystep_dummy()
and some are removed bystep_nzv()
, including those created bystep_dummy()
.knowing the input and output of
step_nzv()
lets the user know what variables are created.Using the same recipe, we are given the variables
"PC1", "PC2", "PC3", "PC4", "PC5"
, which are not that useful for the end user if they are interested in variable importance. If we had the input/output information. We would be able to deduce backward, seeing which dummy variables are created, and how much each of those contributes to each component.Lastly depending on which variables were removed with
step_nzv()
we might be able to deduce that some variables won't be necessary as input, as they are fully removed.On the dev side, we will be able to use this information to refactor some of the selecting and name-creating code that happens in many steps.
Solution
Each recipe step is essentially a list of information. Adding this information would be done as another field.
Implementation
This is where we need help!
I'm thinking that this information could be represented as a list of character vectors or as a sparse matrix. Keep in mind that you will need one for each step and that we will want to "combine" these to get an inference of what happens to each variable.
These are talks I'm not very well versed in, and I wouldn't be surprised if there was a igraph function that would do what we need with ease.
Backwards compatibility
Should be trivial, as we are "just" adding another field for each step.
Types of steps
In the above definition, "many" means non-negative.
One to one
one to many
many to many
removing
adding
none to none
all of above