Closed RobertMyles closed 4 years ago
One of the basic models of skimr use is to use it iteratively for exploratory data wrangling and that the wrangling would be reproducible. I don't think this model would be one where there is a single long chain of piped together, but would have several steps. This is also more in line with R as a functional language.
So, you can use a pipe chain that starts with skim()
to get a vector of variable names where the variables meet some criteria such as the one you have given. Then in a second step you can subset (with either subset notation []
or dplyr::select()
the original data frame just those variables.
I think you would have to do it in a two step process, use skimr to identify the variables you want to subset and then return to the original data to select the columns.
library(skimr)
library(magrittr)
library(dplyr)
x <- c(rnorm(9), NA)
y <- c(rnorm(5), rep(NA, 5))
z <- c(rep("red", 3), rep("blue", 5), rep(NA, 2))
df <- data.frame(x, y, z)
var_list <- df %>%
skim() %>%
dplyr::filter(complete_rate > .5 ) %>%
select(skim_variable)
df %>% select(var_list$skim_variable)
You could potentially make that one chain but I'm not convinced it would be better, more maintainable code.
Thanks @elinw . That's the way I'm using skimr at the moment -- extract a vector of variable names where the complete rate is under a certain proportion and then using it to filter the dataframe, in two steps, as you describe. I see your point, perhaps this wouldn't be a 'better' code workflow, but it certainly appeals to me, I have to say. perhaps there would be significant work/overhead in trying to make dataframes/tibbles etc aware of this extra skimr information, I'm not sure.
There's a lot to love about skimr
, but this is a case where we're doing a lot more than you'd probably want. You could still use our complete rate function, but dplyr handles all of this in a much nicer way.
iris %>%
dplyr::select_if(~skimr::complete_rate(.x) > .5)
Apologies if this is a) beyond the scope of skimr, or b) already available somewhere in the package (I haven't found it). Basically, I think it would be really useful to be able to use the results for further data wrangling. The use case that prompted this is a large dataframe with 150-odd columns, with 50 or so having columns where the
complete_rate
is 0. What I'd like to be able to do is use these results, for example withdplyr::filter()
. So it could work something like this:The result would be the original dataframe without any skimr information, just filtered based on skimr's analytics. Is there a way to this with skimr?
I know that there are other ways to do this in R, but it would be useful to do it with skimr as part of a more general data exploration workflow.
Thanks!