Feature suggestion: step_outliers

tomazweiss commented 4 years ago

Even though step_spatialsign(), step_BoxCox() and step_YeoJohnson() can take care of outliers, it could be useful to have a step that handles them more directly. For example with Tukey's rule (Q1 − 1.5 IQR, Q3 + 1.5 IQR). The user could then have an option to remove them or replace the value with the cutoff or with NA.

earlev4 commented 4 years ago

Hi Tidymodels/Recipes! Thanks for all efforts and hard work. I just recently started using Recipes and I am enjoying it very much. I agree with @tomazweiss, a step_outliers feature as described would be a really nice addition. The ability to winsorize (cap) would be a very helpful option. Thanks so much!

vadimus202 commented 3 years ago

I think step_range() could be enhanced to accomplish this. Currently, hard-coding min and max arguments is the only option. But it would be nice, if the step could calculate those limits dynamically, based on specified quantiles from the training data.

Of course, brand new step_winsorize() or step_trim() would work too.

brunocarlin commented 3 years ago

Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers

BrisbanePom commented 3 years ago

Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers

Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.

brunocarlin commented 3 years ago

Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers

Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.

I think it follows the same principle as the themis package where an advanced feature should live on its own package, but we can consult with the maintaners for guidance, I have paused development since there was very little interest generated and the two models that I wanted to test were sucessfully run on both ci/cd and prod, but if nescessary adding more models is trivial at this point.

BrisbanePom commented 3 years ago

Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers

Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.

I think it follows the same principle as the themis package where an advanced feature should live on its own package, but we can consult with the maintaners for guidance, I have paused development since there was very little interest generated and the two models that I wanted to test were sucessfully run on both ci/cd and prod, but if nescessary adding more models is trivial at this point.

I see now - so long as the step can be included as part of the recipe (in the same way as the themis steps can) - that's fine for the recipe I am looking to write. At the moment, I need to handle outliers prior to the recipe in my input data - e.g. by manually setting outliers to NA and then using a step_impute in the recipe. Would be nice to handle it all in a step of the recipe.

You say you have paused development - are there any plans to resume / push what is there to CRAN?

brunocarlin commented 3 years ago

Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers

Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.

I think it follows the same principle as the themis package where an advanced feature should live on its own package, but we can consult with the maintaners for guidance, I have paused development since there was very little interest generated and the two models that I wanted to test were sucessfully run on both ci/cd and prod, but if nescessary adding more models is trivial at this point.

I see now - so long as the step can be included as part of the recipe (in the same way as the themis steps can) - that's fine for the recipe I am looking to write. At the moment, I need to handle outliers prior to the recipe in my input data - e.g. by manually setting outliers to NA and then using a step_impute in the recipe. Would be nice to handle it all in a step of the recipe.

You say you have paused development - are there any plans to resume / push what is there to CRAN?

Right now I have started a new job so probably not for the foreseable future, what you can already do with this package is use the more low level functions step_outliers_maha or step_outliers_lookout to create a column named anything and then make a simple mutate on said column with an if_elsestatement for example step_mutate(to_replace = if_else(named_col > .95,NA,to_replace)) and then use a step filter to take the created column out of the df.

BrisbanePom commented 3 years ago

Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers

Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.

I think it follows the same principle as the themis package where an advanced feature should live on its own package, but we can consult with the maintaners for guidance, I have paused development since there was very little interest generated and the two models that I wanted to test were sucessfully run on both ci/cd and prod, but if nescessary adding more models is trivial at this point.

I see now - so long as the step can be included as part of the recipe (in the same way as the themis steps can) - that's fine for the recipe I am looking to write. At the moment, I need to handle outliers prior to the recipe in my input data - e.g. by manually setting outliers to NA and then using a step_impute in the recipe. Would be nice to handle it all in a step of the recipe. You say you have paused development - are there any plans to resume / push what is there to CRAN?

Right now I have started a new job so probably not for the foreseable future, what you can already do with this package is use the more low level functions step_outliers_maha or step_outliers_lookout to create a column named anything and then make a simple mutate on said column with an if_elsestatement for example step_mutate(to_replace = if_else(named_col > .95,NA,to_replace)) and then use a step filter to take the created column out of the df.

Thanks Bruno - I'm new to the package and hadn't noticed step_mutate. I was able to use that to identify outliers in the column of interest and replace them with NA values and then use an imputation method (step_impute_knn) to replace them. Now all my feature engineering is contained in the single recipe as desired.

topepo commented 3 years ago

I agree that it would be better to be in a side-package and would encourage you to do so. Let us know if you need a hand with anything.

brunocarlin commented 2 years ago

Hey guys @topepo, @juliasilge and @BrisbanePom, and @mattwarkentin I have updated the package to version 0.2.0 it now has 5 different functions to detect outliers including a very flexible user-defined function way called univariate that implements what Matt was asking for, I also changed the naming to use scores instead of probabilities since some methods don't return estimates of probabilities.

If you could give some feedback on what I need to integrate the package into the tidymodels ecosystem that would be great! Thanks for the amazing framework it was quite easy to extend, know I plan manually add some tune parameters like controlling the number of trees on some forest based methods.

check it out at https://github.com/brunocarlin/tidy.outliers

topepo commented 2 years ago

I'll be able to look at it before the end of the week.

EmilHvitfeldt commented 2 years ago

Progress is going good in https://github.com/brunocarlin/tidy.outliers. I'll close this issue an encourage further discussion regarding outlier steps to happen in that repository.

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

tidymodels / recipes

Feature suggestion: step_outliers #484