Open kylebutts opened 4 years ago
I think this is a good idea. Some comments:
partition()
. We could extent that function with an argument by_group
, with a default value of FALSE
.bind()
variant to reverse that operation.# get group_names to name list after group_split
is a pmap_chr operation.Otherwise, we can go into more details during a PR. @elinw How does this sound to you?
Hi I think it is an interesting concept for sure, and it's not so far from what skim does that it would violate the do one thing well concept.
You might want to also look here
https://github.com/elinw/skimrExtra https://github.com/elinw/skimrextra/blob/master/R/skim_with.R#L8
The idea being to be able to create a variable table like that in many articles.
I also wonder if looking at the codebook package which uses skimr extensively might be a way to go, keeping things more modular.
One interesting aspect of this is that it is including a separate summary for each group. I have been thinking a lot about summary and why it is that to make it not show in the print you have to currently explicitly use a print function. This is kind of a broad conundrum in R that piping doesn't really play well with at this moment.
I'm thinking that this output could be achieved more simply using our current vocabulary, the simple point being that it is approaching the skim by group problem from a different angle, one that was discussed initially at the #unconf actually and that @GShotwell had a lot of ideas about. At the time the whole model was different, but now that we have the one big data frame I do think we should revisit the option of taking the grouped output and returning it in list format bygroup and then maybe giving it a nice print method instead of just using list print.
Please keep in mind that mtcars is a terrible sample data set for skimr because it has only one type of variable. Use something like CO2 instead to really understand what an output would look like.
Are you definitely committed to having the summary by group? Because since we already support calculating the statistics on groups, playing with this a bit I think it is actually not hard to take the skim object for grouped data and pull out the subtables for the top n grouping levels (and leave any other grouping in place for those). Also remember that we already store the group names as an attribute.
This
#' @export
grouped_skims <- function(data, depth = 1) {
assert_is_skim_df(data)
groups <- group_names(data)
if (length(groups) < 1){
stop("Must have a least one group to use grouped_skims")
}
group_vars <- as.character(groups[1:depth])
groups <- ifelse (depth >= length(groups),
groups[depth+1: length(groups)],
NULL)
data$selected_groups <- do.call(paste, c(data[group_vars], sep="_"))
data <- dplyr::select(data, -all_of(group_vars))
base <- base_skimmers(data)
data_as_list <- split(data, data$selected_groups)
skimmers <- reconcile_skimmers(data, groups, base)
reduced <- purrr::imap(data_as_list, simplify_skimdf, skimmers, groups, base)
data_as_list <- purrr::map( data_as_list, dplyr::select, -selected_groups)
reassign_skim_attrs(
reduced,
data,
class = "skim_list",
skimmers_used = skimmers
)
data_as_list
}
Gets you what you are getting now ... and I think that's a start. We'd need @michaelquinn32 to come up with a good verb for this. The one issue is that the nrows for the summary is wrong, but do we want summary here?? I think we would really want to consider the print ... right now it is doing skim print on the skim object in each element of the list, but I think it would be cleaner to return the tibble/data frame as e.g. results$virginica$skim_df and then results$virginica$summary. People will be confused by the print here. On the other hand maybe just make a nicer print function that doesn't look like a list.
Updated with cleaner code.
@michaelquinn32 and I discussed this and he proposed more of a tidyverse style semantic where the skimmed object is piped to a function that identifies the groups to be used for creating the subtables and then it is pipped to a variation of partition()
based on those groups. I think this could work, we'd have to add an attribute for these groups and probably modify the groups attribute and the main skim data frame object.
Then maybe we can make a convenience function that combines all the steps.
I like the idea of including it in partition()
as well. The basic reason I wanted the request was to make it easier to make a table that had group-names as the columns and the stats as the rows, rather than the tidy format of everything as a row. How can I help with this code?
Wait group names as columns? Can you show what you mean? Just make an example table/rough sketch.
I think I'm pretty far along in thinking about how to do one model of this that is more like partition (still rows but broken out by groups.
Yes, sorry I was unclear! I was trying to end up with a table like this and figured the easiest way would be to split by group into a list and then lapply to make each column. So we are thinking of the same idea (still rows but broken out by groups), but I would then do an extra step to get a table like below
Way back when we started skimr we played around with this a lot, but because skimr uses data frames and each data frame must be a single data type it is quite complex, which is why there are so many packages for making nice tables. I think what we could to is concentrate on producing data that would be processed by those.
Often times researchers want to create summary statistics for different groups. In tables, each column tends to be a group and each row the summary stat for that group/var. It would be nice if skimr made it easier to reshape the stats for each group into a list.
I had an initial go at this function, but am not 100% confident I'm not missing something. Here's a reprex with the split_by_group function:
` library(skimr) library(tidyverse)
`
This outputs:
` $"4 cyl - 3 gears" ── Data Summary ──────────────────────── Values
Name Piped data Number of rows 32
Number of columns 11
`