ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.11k stars 79 forks source link

Feature Request: Use results for further wrangling #557

Closed RobertMyles closed 4 years ago

RobertMyles commented 4 years ago

Apologies if this is a) beyond the scope of skimr, or b) already available somewhere in the package (I haven't found it). Basically, I think it would be really useful to be able to use the results for further data wrangling. The use case that prompted this is a large dataframe with 150-odd columns, with 50 or so having columns where the complete_rate is 0. What I'd like to be able to do is use these results, for example with dplyr::filter(). So it could work something like this:

df %>% 
  skim() %>% 
  filter_if(.complete_rate < 0.25)

The result would be the original dataframe without any skimr information, just filtered based on skimr's analytics. Is there a way to this with skimr?

I know that there are other ways to do this in R, but it would be useful to do it with skimr as part of a more general data exploration workflow.

Thanks!

elinw commented 4 years ago

One of the basic models of skimr use is to use it iteratively for exploratory data wrangling and that the wrangling would be reproducible. I don't think this model would be one where there is a single long chain of piped together, but would have several steps. This is also more in line with R as a functional language.

So, you can use a pipe chain that starts with skim() to get a vector of variable names where the variables meet some criteria such as the one you have given. Then in a second step you can subset (with either subset notation [] or dplyr::select() the original data frame just those variables.

I think you would have to do it in a two step process, use skimr to identify the variables you want to subset and then return to the original data to select the columns.

library(skimr)
library(magrittr)
library(dplyr)
x <- c(rnorm(9), NA)
y <- c(rnorm(5), rep(NA, 5))
z <- c(rep("red", 3), rep("blue", 5), rep(NA, 2))
df <- data.frame(x, y, z)

var_list <- df %>% 
                 skim() %>% 
                 dplyr::filter(complete_rate > .5 ) %>% 
                 select(skim_variable)
df %>% select(var_list$skim_variable)

You could potentially make that one chain but I'm not convinced it would be better, more maintainable code.

RobertMyles commented 4 years ago

Thanks @elinw . That's the way I'm using skimr at the moment -- extract a vector of variable names where the complete rate is under a certain proportion and then using it to filter the dataframe, in two steps, as you describe. I see your point, perhaps this wouldn't be a 'better' code workflow, but it certainly appeals to me, I have to say. perhaps there would be significant work/overhead in trying to make dataframes/tibbles etc aware of this extra skimr information, I'm not sure.

michaelquinn32 commented 4 years ago

There's a lot to love about skimr, but this is a case where we're doing a lot more than you'd probably want. You could still use our complete rate function, but dplyr handles all of this in a much nicer way.

iris %>%
  dplyr::select_if(~skimr::complete_rate(.x) > .5)