tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
564 stars 111 forks source link

Helpers for groups/panel data #166

Closed ClaytonJY closed 2 years ago

ClaytonJY commented 6 years ago

Is there a way to apply one or more steps on a per-group basis? Suppose we have panel data (surprise, surprise) and want to apply step_lag on each group, is there a trick to getting this to work now?

If not, I've got two ideas:

From a UX perspective I'd lean towards the second; steps already have an awful lot of args, and mirroring dplyr would make this very easy on users.

topepo commented 6 years ago

That's a good idea. It would be better to add it on a per-step function but that would also mean big changes to the api.

This would be behind the devel list a bit so I'll leave it open (and hopefully get some discussion) and mark it as not-short-term.

BenoitLondon commented 3 years ago

Hi, any plan on this? that's a real blocker for me :(

For example step_mode/median/meanImpute would be very useful with a grouped version

juliasilge commented 3 years ago

We don't have this on a near-term plan for development work but are interested in hearing about your use case and needs in order to prioritize this idea.

vspinu commented 3 years ago

It's somewhat surprising that this issue is open for 3 years and its still not on a "near-term" plan. Isn't a group_by operation for data processing package an obvious feature?

Adding to @ClaytonJY original 2 ideas of the implementation:

BenoitLondon commented 3 years ago

We don't have this on a near-term plan for development work but are interested in hearing about your use case and needs in order to prioritize this idea.

There are many use cases, most relevant I have now is for example you miss longitude/latitude for some rows of your dataset but you have country info. Imputing per country would make much more sense than globally...

joeycouse commented 3 years ago

I'll third this issue. An example I'm facing is imputing the finish time of a race which is very track dependent. Imputing with a group mean for the track would make much more sense.

jestover commented 2 years ago

I've recently started trying out the tidy model packages and was surprised that this was not an option. By the way, is there any advantage to using recipes over just transforming the data in dplyr or something? That is what I wound up doing.

joeycouse commented 2 years ago

@jestover the advantage to using recipes is protection from data leak, as operations are done within the resampling loop. If you do those transformation across the entire training set (e.g. outside resampling) this is introducing bias into your model's performance estimates.

jestover commented 2 years ago

Thanks @joeycouse! Just to make sure I'm understanding correctly let me ask about a simple example. If I demean a variable using dplyr it will obviously use the sample mean from the entire dataset. If I use recipes and then split my data into a training set and a testing set then recipes will demean the training set using the sample mean from the training set and demean the testing set using the sample mean from the testing set?

joeycouse commented 2 years ago

@jestover, Along those lines but slightly different. Any information, including the mean, from the testing set should never be used as a part of the model tuning process. The sample mean estimated from the training set should be applied to the testing set, if you intend to substract the mean value. This same principle applies to cross-validation. In the case of 10 fold CV, you would estimate the mean from the 9 analysis folds and that estimated mean would be substracted from the values in the single hold-out fold. recipes handles this nuance automatically and protects users from the common pit-falls associated with feature engineering.

jestover commented 2 years ago

Ok, that makes a lot of sense. I appreciate the clarification @joeycouse

mitokic commented 2 years ago

This would be great for use in a dataset with multiple time series when building a global model of all time series (panel data). Right now you would have to build a custom recipe step to apply groupwise transformations. Please consider working on this feature!

tedmoorman commented 2 years ago

This should've been created a long time ago. Looks like I'll have to create a custom step for my project. Bummer.

StatsMatt commented 2 years ago

I have another use case. Let's say I am trying to predict which vehicles will repair using the make and model name. I can concatenate the make and model of my training set and use the repair percentage of that make and model as a numeric predictor (i.e. target encoding). So, I can group by make and model and calculate the mean of 1's and 0's for the repair percentage. I will most likely have unseen levels, so I will use this in combo with step_novel. Thanks.

topepo commented 2 years ago

Regarding a global method to make all steps capable of running computations per group, IMO that's not a great design choice so we won't be doing it.

However, there are a lot of cases where this might make sense. Panel data may be one. For me, I'd love to make a set of chemometrics steps for processing spectra or high-content screening data. It would make sense, in those instances, to be able to group by a set of columns.

There are a few requests in the thread here that can already be accomplished without additional (overall) changes.

My suggestion: for specific features, start new GH issues with the motivation and and a worked, reproducible example of what you want. Responding at the end of an issue that proposed general changes isn't a good way to go about creating new features.

There are going to be times where we decline. That's disappointing but ok; there are custom steps that can be created. textrecipes is a great example of this. We didn't have a structure to work with tokenized data and the author made one that works really well for the particular problem that he is trying to solve. We didn't add this universally since that didn't make sense.

I'm going to close this and hope that more specific issues can be posted with more details. @ClaytonJY it would be great to outline what exactly you would want for panel data in particular.

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.