tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
569 stars 113 forks source link

Truncate a variable at quantiles of the variable distribution #850

Closed mattwarkentin closed 3 years ago

mattwarkentin commented 3 years ago

Feature

I don't think a step like this currently exists, so I wanted to nominate step_truncate (could also be called step_clip or step_clamp maybe??):

The step would truncate numeric variables based on percentiles of the variable distribution. For example, if you wanted to truncate x at the 1st and 99th percentile, this would assign the 1st and 99th percentile value to observations that are below and above these values, respectively.

Perhaps a step already exists but I didn't see it. If there is interest, I can contribute a PR.

juliasilge commented 3 years ago

In tidymodels/recipes#484 there has been some discussion of handling outliers, perhaps with Tukey's rule or something else. Later in that issue we raised the issue of maybe this being more appropriate to an entirely separate recipes extension package for outlier feature engineering, like themis handles class imbalance and subsampling.

It does seem like there are quite a lot of approaches and it might make sense to have them all together in one package.

mattwarkentin commented 3 years ago

Ahh okay, yes I didn't come across that issue in my searches. It probably does make sense to have outlier preprocessing contained in an adjacent package.

topepo commented 3 years ago

Would you like to make one? If so, let us know if you need any help.

mattwarkentin commented 3 years ago

I am interested. Under the tidymodels umbrella?

topepo commented 2 years ago

We would be fine with something that lives in your repo as well as something that sits in the tidymodels org (maintained by you either way). In the latter case, it would be good to keep it in scope with tidymodels (e.g. not unrelated model functions).

There isn't much different based on where it lives. infer and Emil's packages were developed by people outside our our group so you can always ping them to get advice (fyi Emil is joining our team in a few weeks)

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.