Closed ClaytonJY closed 2 years ago
That's a good idea. It would be better to add it on a per-step function but that would also mean big changes to the api.
This would be behind the devel list a bit so I'll leave it open (and hopefully get some discussion) and mark it as not-short-term.
Hi, any plan on this? that's a real blocker for me :(
For example step_mode/median/meanImpute would be very useful with a grouped version
We don't have this on a near-term plan for development work but are interested in hearing about your use case and needs in order to prioritize this idea.
It's somewhat surprising that this issue is open for 3 years and its still not on a "near-term" plan. Isn't a group_by operation for data processing package an obvious feature?
Adding to @ClaytonJY original 2 ideas of the implementation:
groupwise
whose ...
would expand to individual steps. Thus one get's a nested arrangements of steps
data %>%
step_select(x, y, z) %>%
groupwise(by = "z",
step_lag(x),
step_center(y))
We don't have this on a near-term plan for development work but are interested in hearing about your use case and needs in order to prioritize this idea.
There are many use cases, most relevant I have now is for example you miss longitude/latitude for some rows of your dataset but you have country info. Imputing per country would make much more sense than globally...
I'll third this issue. An example I'm facing is imputing the finish time of a race which is very track dependent. Imputing with a group mean for the track would make much more sense.
I've recently started trying out the tidy model packages and was surprised that this was not an option. By the way, is there any advantage to using recipes over just transforming the data in dplyr or something? That is what I wound up doing.
@jestover the advantage to using recipes is protection from data leak, as operations are done within the resampling loop. If you do those transformation across the entire training set (e.g. outside resampling) this is introducing bias into your model's performance estimates.
Thanks @joeycouse! Just to make sure I'm understanding correctly let me ask about a simple example. If I demean a variable using dplyr it will obviously use the sample mean from the entire dataset. If I use recipes and then split my data into a training set and a testing set then recipes will demean the training set using the sample mean from the training set and demean the testing set using the sample mean from the testing set?
@jestover, Along those lines but slightly different. Any information, including the mean, from the testing set should never be used as a part of the model tuning process. The sample mean estimated from the training set should be applied to the testing set, if you intend to substract the mean value. This same principle applies to cross-validation. In the case of 10 fold CV, you would estimate the mean from the 9 analysis folds and that estimated mean would be substracted from the values in the single hold-out fold. recipes
handles this nuance automatically and protects users from the common pit-falls associated with feature engineering.
Ok, that makes a lot of sense. I appreciate the clarification @joeycouse
This would be great for use in a dataset with multiple time series when building a global model of all time series (panel data). Right now you would have to build a custom recipe step to apply groupwise transformations. Please consider working on this feature!
This should've been created a long time ago. Looks like I'll have to create a custom step for my project. Bummer.
I have another use case. Let's say I am trying to predict which vehicles will repair using the make and model name. I can concatenate the make and model of my training set and use the repair percentage of that make and model as a numeric predictor (i.e. target encoding). So, I can group by make and model and calculate the mean of 1's and 0's for the repair percentage. I will most likely have unseen levels, so I will use this in combo with step_novel. Thanks.
Regarding a global method to make all steps capable of running computations per group, IMO that's not a great design choice so we won't be doing it.
However, there are a lot of cases where this might make sense. Panel data may be one. For me, I'd love to make a set of chemometrics steps for processing spectra or high-content screening data. It would make sense, in those instances, to be able to group by a set of columns.
There are a few requests in the thread here that can already be accomplished without additional (overall) changes.
For imputing by groups, this can be done using step_impute_linear()
or step_bag_impute()
by passing the grouping variable to the impute_with
argument. That will get you mean imputation by group.
For concatenating data prior to step_novel()
, this can be done via step_mutate()
.
Effect encoding can already be done with any of the step_lencode_*()
functions in the embed
package.
My suggestion: for specific features, start new GH issues with the motivation and and a worked, reproducible example of what you want. Responding at the end of an issue that proposed general changes isn't a good way to go about creating new features.
There are going to be times where we decline. That's disappointing but ok; there are custom steps that can be created. textrecipes
is a great example of this. We didn't have a structure to work with tokenized data and the author made one that works really well for the particular problem that he is trying to solve. We didn't add this universally since that didn't make sense.
I'm going to close this and hope that more specific issues can be posted with more details. @ClaytonJY it would be great to outline what exactly you would want for panel data in particular.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
Is there a way to apply one or more steps on a per-group basis? Suppose we have panel data (surprise, surprise) and want to apply
step_lag
on each group, is there a trick to getting this to work now?If not, I've got two ideas:
group
orstrata
that handles grouped-application in one spotstep_group()
orstep_stratify()
which acts likedplyr::group_by()
in that following operations are applied per group. Would presumably then needstep_ungroup()
/step_unstratify()
to revert to normal application. Might also be able to create agroup_by.recipe()
method, agrouped_recipe
class, etc. to avoid new steps and usegroup_by
very naturally?From a UX perspective I'd lean towards the second; steps already have an awful lot of args, and mirroring
dplyr
would make this very easy on users.