Closed tomazweiss closed 2 years ago
Hi Tidymodels/Recipes! Thanks for all efforts and hard work. I just recently started using Recipes and I am enjoying it very much. I agree with @tomazweiss, a step_outliers feature as described would be a really nice addition. The ability to winsorize (cap) would be a very helpful option. Thanks so much!
I think step_range()
could be enhanced to accomplish this. Currently, hard-coding min
and max
arguments is the only option. But it would be nice, if the step could calculate those limits dynamically, based on specified quantiles from the training data.
Of course, brand new step_winsorize()
or step_trim()
would work too.
Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers
Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers
Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.
Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers
Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.
I think it follows the same principle as the themis package where an advanced feature should live on its own package, but we can consult with the maintaners for guidance, I have paused development since there was very little interest generated and the two models that I wanted to test were sucessfully run on both ci/cd and prod, but if nescessary adding more models is trivial at this point.
Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers
Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.
I think it follows the same principle as the themis package where an advanced feature should live on its own package, but we can consult with the maintaners for guidance, I have paused development since there was very little interest generated and the two models that I wanted to test were sucessfully run on both ci/cd and prod, but if nescessary adding more models is trivial at this point.
I see now - so long as the step can be included as part of the recipe (in the same way as the themis steps can) - that's fine for the recipe I am looking to write. At the moment, I need to handle outliers prior to the recipe in my input data - e.g. by manually setting outliers to NA and then using a step_impute in the recipe. Would be nice to handle it all in a step of the recipe.
You say you have paused development - are there any plans to resume / push what is there to CRAN?
Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers
Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.
I think it follows the same principle as the themis package where an advanced feature should live on its own package, but we can consult with the maintaners for guidance, I have paused development since there was very little interest generated and the two models that I wanted to test were sucessfully run on both ci/cd and prod, but if nescessary adding more models is trivial at this point.
I see now - so long as the step can be included as part of the recipe (in the same way as the themis steps can) - that's fine for the recipe I am looking to write. At the moment, I need to handle outliers prior to the recipe in my input data - e.g. by manually setting outliers to NA and then using a step_impute in the recipe. Would be nice to handle it all in a step of the recipe.
You say you have paused development - are there any plans to resume / push what is there to CRAN?
Right now I have started a new job so probably not for the foreseable future, what you can already do with this package is use the more low level functions step_outliers_maha
or step_outliers_lookout
to create a column named anything and then make a simple mutate on said column with an if_else
statement for example step_mutate(to_replace = if_else(named_col > .95,NA,to_replace))
and then use a step filter to take the created column out of the df.
Hi I have created a new package to handle this issue I believe like themis this should live outside of recipes check it out https://github.com/brunocarlin/tidy.outliers
Interested as to why you think this should live outside of recipes? For me, replacing outliers is a logical feature engineering step that sits well within other recipe transformation steps.
I think it follows the same principle as the themis package where an advanced feature should live on its own package, but we can consult with the maintaners for guidance, I have paused development since there was very little interest generated and the two models that I wanted to test were sucessfully run on both ci/cd and prod, but if nescessary adding more models is trivial at this point.
I see now - so long as the step can be included as part of the recipe (in the same way as the themis steps can) - that's fine for the recipe I am looking to write. At the moment, I need to handle outliers prior to the recipe in my input data - e.g. by manually setting outliers to NA and then using a step_impute in the recipe. Would be nice to handle it all in a step of the recipe. You say you have paused development - are there any plans to resume / push what is there to CRAN?
Right now I have started a new job so probably not for the foreseable future, what you can already do with this package is use the more low level functions
step_outliers_maha
orstep_outliers_lookout
to create a column named anything and then make a simple mutate on said column with anif_else
statement for examplestep_mutate(to_replace = if_else(named_col > .95,NA,to_replace))
and then use a step filter to take the created column out of the df.
Thanks Bruno - I'm new to the package and hadn't noticed step_mutate. I was able to use that to identify outliers in the column of interest and replace them with NA values and then use an imputation method (step_impute_knn) to replace them. Now all my feature engineering is contained in the single recipe as desired.
I agree that it would be better to be in a side-package and would encourage you to do so. Let us know if you need a hand with anything.
Hey guys @topepo, @juliasilge and @BrisbanePom, and @mattwarkentin I have updated the package to version 0.2.0 it now has 5 different functions to detect outliers including a very flexible user-defined function way called univariate that implements what Matt was asking for, I also changed the naming to use scores instead of probabilities since some methods don't return estimates of probabilities.
If you could give some feedback on what I need to integrate the package into the tidymodels ecosystem that would be great! Thanks for the amazing framework it was quite easy to extend, know I plan manually add some tune parameters like controlling the number of trees on some forest based methods.
check it out at https://github.com/brunocarlin/tidy.outliers
I'll be able to look at it before the end of the week.
Progress is going good in https://github.com/brunocarlin/tidy.outliers. I'll close this issue an encourage further discussion regarding outlier steps to happen in that repository.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
Even though
step_spatialsign()
,step_BoxCox()
andstep_YeoJohnson()
can take care of outliers, it could be useful to have a step that handles them more directly. For example with Tukey's rule (Q1 − 1.5 IQR, Q3 + 1.5 IQR). The user could then have an option to remove them or replace the value with the cutoff or with NA.