ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.11k stars 79 forks source link

Split skim_df by group feature for reshape.R #576

Open kylebutts opened 4 years ago

kylebutts commented 4 years ago

Often times researchers want to create summary statistics for different groups. In tables, each column tends to be a group and each row the summary stat for that group/var. It would be nice if skimr made it easier to reshape the stats for each group into a list.

I had an initial go at this function, but am not 100% confident I'm not missing something. Here's a reprex with the split_by_group function:

` library(skimr) library(tidyverse)

reconcile_skimmers <- function(data, groups, base) {
    all_columns <- names(data)
    skimmers_used <- skimmers_used(data)
    with_base_columns <- c(
        "skim_variable",
        "skim_type",
        base,
        collapse_skimmers(skimmers_used)
    )
    extra_cols <- dplyr::setdiff(all_columns, with_base_columns)
    if (length(extra_cols) > 0) {
        grouped <- dplyr::group_by(data, .data$skim_type)
        complete_by_type <- dplyr::summarize_at(
            grouped,
            dplyr::vars(extra_cols),
            ~ !all(is.na(.x))
        )
        complete_cols <- purrr::pmap(
            complete_by_type,
            get_complete_columns,
            names = extra_cols
        )
        new_cols_by_type <- rlang::set_names(
            complete_cols,
            complete_by_type$skim_type
        )
        skimmers_used <- purrr::list_merge(skimmers_used, !!!new_cols_by_type)
    }

    skimmers_used
}

collapse_skimmers <- function(skimmers_used) {
    with_type <- purrr::imap(skimmers_used, ~ paste(.y, .x, sep = "."))
    purrr::flatten_chr(with_type)
}

get_complete_columns <- function(skim_type, ..., names) {
    names[c(...)]
}

split_by_group <- function(data){
    assert_is_skim_df(data)

    groups <- group_names(data)
    base <- base_skimmers(data)

    skimmers <- reconcile_skimmers(data, groups, base)

    # get group_names to name list after group_split
    group_name <- data %>% 
        dplyr::group_keys(!!! groups) %>%
        unite(group_name, sep = " - ") %>%
        .[["group_name"]]   

    # Name list by group_names
    data_by_group <- data %>% dplyr::group_split(!!! groups) %>% setNames(group_name)

    # Make each data frame a skim_df
    data_by_group <- lapply(data_by_group, function(x) { 
        attr(x, "class") <- c("skim_df", class(x)) 
        return(x)
    } )

    # Make list a skim_lists
    attr(data_by_group, "class") <- c("skim_list", "list")

    # Return skim_list split by groups
    data_by_group

}

data <- mtcars %>% 
    mutate(cyl = factor(.$cyl, levels = c(4,6,8), labels = c("4 cyl", "6 cyl", "8 cyl")),
           gear = factor(.$gear, levels = c(3,4,5), labels = c("3 gears", "4 gears", "5 gears"))) %>%
    group_by(cyl, gear) %>% 
    skim()

split_by_group(data)

`

This outputs:

` $"4 cyl - 3 gears" ── Data Summary ──────────────────────── Values
Name Piped data Number of rows 32
Number of columns 11


Column type frequency:               
    numeric                  9         
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate   mean    sd     p0    p25    p50    p75   p100 hist  cyl   gear   
1 mpg                   0             1  21.5     NA  21.5   21.5   21.5   21.5   21.5  ▁▁▇▁▁ 4 cyl 3 gears
2 disp                  0             1 120.      NA 120.   120.   120.   120.   120.   ▁▁▇▁▁ 4 cyl 3 gears
3 hp                    0             1  97       NA  97     97     97     97     97    ▁▁▇▁▁ 4 cyl 3 gears
4 drat                  0             1   3.7     NA   3.7    3.7    3.7    3.7    3.7  ▁▁▇▁▁ 4 cyl 3 gears
5 wt                    0             1   2.46    NA   2.46   2.46   2.46   2.46   2.46 ▁▁▇▁▁ 4 cyl 3 gears
6 qsec                  0             1  20.0     NA  20.0   20.0   20.0   20.0   20.0  ▁▁▇▁▁ 4 cyl 3 gears
7 vs                    0             1   1       NA   1      1      1      1      1    ▁▁▇▁▁ 4 cyl 3 gears
8 am                    0             1   0       NA   0      0      0      0      0    ▁▁▇▁▁ 4 cyl 3 gears
9 carb                  0             1   1       NA   1      1      1      1      1    ▁▁▇▁▁ 4 cyl 3 gears

$"4 cyl - 4 gears"
── Data Summary ────────────────────────
Values    
Name                       Piped data
Number of rows             32        
Number of columns          11        
_______________________              
Column type frequency:               
    numeric                  9         
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate   mean     sd    p0   p25   p50    p75   p100 hist  cyl   gear   
1 mpg                   0             1  26.9   4.81  21.4  22.8  25.8   30.9   33.9  ▇▂▂▂▅ 4 cyl 4 gears
2 disp                  0             1 103.   30.7   71.1  78.0  93.5  126.   147.   ▇▁▂▂▃ 4 cyl 4 gears
3 hp                    0             1  76    20.1   52    64.2  66     93.5  109    ▅▇▁▅▂ 4 cyl 4 gears
4 drat                  0             1   4.11  0.372  3.69  3.90  4.08   4.14   4.93 ▇▇▂▁▂ 4 cyl 4 gears
5 wt                    0             1   2.38  0.601  1.62  1.91  2.26   2.87   3.19 ▇▇▃▃▇ 4 cyl 4 gears
6 qsec                  0             1  19.6   1.45  18.5  18.6  19.2   19.9   22.9  ▇▆▁▁▂ 4 cyl 4 gears
7 vs                    0             1   1     0      1     1     1      1      1    ▁▁▇▁▁ 4 cyl 4 gears
8 am                    0             1   0.75  0.463  0     0.75  1      1      1    ▂▁▁▁▇ 4 cyl 4 gears
9 carb                  0             1   1.5   0.535  1     1     1.5    2      2    ▇▁▁▁▇ 4 cyl 4 gears

$"4 cyl - 5 gears"
── Data Summary ────────────────────────
Values    
Name                       Piped data
Number of rows             32        
Number of columns          11        
_______________________              
Column type frequency:               
    numeric                  9         
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate   mean     sd    p0    p25    p50    p75   p100 hist  cyl   gear   
1 mpg                   0             1  28.2   3.11  26     27.1   28.2   29.3   30.4  ▇▁▁▁▇ 4 cyl 5 gears
2 disp                  0             1 108.   17.8   95.1  101.   108.   114    120.   ▇▁▁▁▇ 4 cyl 5 gears
3 hp                    0             1 102    15.6   91     96.5  102    108.   113    ▇▁▁▁▇ 4 cyl 5 gears
4 drat                  0             1   4.1   0.467  3.77   3.94   4.1    4.26   4.43 ▇▁▁▁▇ 4 cyl 5 gears
5 wt                    0             1   1.83  0.443  1.51   1.67   1.83   1.98   2.14 ▇▁▁▁▇ 4 cyl 5 gears
6 qsec                  0             1  16.8   0.141 16.7   16.8   16.8   16.8   16.9  ▇▁▁▁▇ 4 cyl 5 gears
7 vs                    0             1   0.5   0.707  0      0.25   0.5    0.75   1    ▇▁▁▁▇ 4 cyl 5 gears
8 am                    0             1   1     0      1      1      1      1      1    ▁▁▇▁▁ 4 cyl 5 gears
9 carb                  0             1   2     0      2      2      2      2      2    ▁▁▇▁▁ 4 cyl 5 gears

$"6 cyl - 3 gears"
── Data Summary ────────────────────────
Values    
Name                       Piped data
Number of rows             32        
Number of columns          11        
_______________________              
Column type frequency:               
    numeric                  9         
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate   mean     sd     p0    p25    p50    p75   p100 hist  cyl   gear   
1 mpg                   0             1  19.8   2.33   18.1   18.9   19.8   20.6   21.4  ▇▁▁▁▇ 6 cyl 3 gears
2 disp                  0             1 242.   23.3   225    233.   242.   250.   258    ▇▁▁▁▇ 6 cyl 3 gears
3 hp                    0             1 108.    3.54  105    106.   108.   109.   110    ▇▁▁▁▇ 6 cyl 3 gears
4 drat                  0             1   2.92  0.226   2.76   2.84   2.92   3      3.08 ▇▁▁▁▇ 6 cyl 3 gears
5 wt                    0             1   3.34  0.173   3.22   3.28   3.34   3.40   3.46 ▇▁▁▁▇ 6 cyl 3 gears
6 qsec                  0             1  19.8   0.552  19.4   19.6   19.8   20.0   20.2  ▇▁▁▁▇ 6 cyl 3 gears
7 vs                    0             1   1     0       1      1      1      1      1    ▁▁▇▁▁ 6 cyl 3 gears
8 am                    0             1   0     0       0      0      0      0      0    ▁▁▇▁▁ 6 cyl 3 gears
9 carb                  0             1   1     0       1      1      1      1      1    ▁▁▇▁▁ 6 cyl 3 gears

$"6 cyl - 4 gears"
── Data Summary ────────────────────────
Values    
Name                       Piped data
Number of rows             32        
Number of columns          11        
_______________________              
Column type frequency:               
    numeric                  9         
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate   mean     sd     p0    p25    p50    p75   p100 hist  cyl   gear   
1 mpg                   0             1  19.8  1.55    17.8   18.8   20.1   21     21    ▃▁▃▁▇ 6 cyl 4 gears
2 disp                  0             1 164.   4.39   160    160    164.   168.   168.   ▇▁▁▁▇ 6 cyl 4 gears
3 hp                    0             1 116.   7.51   110    110    116.   123    123    ▇▁▁▁▇ 6 cyl 4 gears
4 drat                  0             1   3.91 0.0115   3.9    3.9    3.91   3.92   3.92 ▇▁▁▁▇ 6 cyl 4 gears
5 wt                    0             1   3.09 0.413    2.62   2.81   3.16   3.44   3.44 ▃▃▁▁▇ 6 cyl 4 gears
6 qsec                  0             1  17.7  1.12    16.5   16.9   17.7   18.5   18.9  ▇▇▁▇▇ 6 cyl 4 gears
7 vs                    0             1   0.5  0.577    0      0      0.5    1      1    ▇▁▁▁▇ 6 cyl 4 gears
8 am                    0             1   0.5  0.577    0      0      0.5    1      1    ▇▁▁▁▇ 6 cyl 4 gears
9 carb                  0             1   4    0        4      4      4      4      4    ▁▁▇▁▁ 6 cyl 4 gears

$"6 cyl - 5 gears"
── Data Summary ────────────────────────
Values    
Name                       Piped data
Number of rows             32        
Number of columns          11        
_______________________              
Column type frequency:               
    numeric                  9         
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate   mean    sd     p0    p25    p50    p75   p100 hist  cyl   gear   
1 mpg                   0             1  19.7     NA  19.7   19.7   19.7   19.7   19.7  ▁▁▇▁▁ 6 cyl 5 gears
2 disp                  0             1 145       NA 145    145    145    145    145    ▁▁▇▁▁ 6 cyl 5 gears
3 hp                    0             1 175       NA 175    175    175    175    175    ▁▁▇▁▁ 6 cyl 5 gears
4 drat                  0             1   3.62    NA   3.62   3.62   3.62   3.62   3.62 ▁▁▇▁▁ 6 cyl 5 gears
5 wt                    0             1   2.77    NA   2.77   2.77   2.77   2.77   2.77 ▁▁▇▁▁ 6 cyl 5 gears
6 qsec                  0             1  15.5     NA  15.5   15.5   15.5   15.5   15.5  ▁▁▇▁▁ 6 cyl 5 gears
7 vs                    0             1   0       NA   0      0      0      0      0    ▁▁▇▁▁ 6 cyl 5 gears
8 am                    0             1   1       NA   1      1      1      1      1    ▁▁▇▁▁ 6 cyl 5 gears
9 carb                  0             1   6       NA   6      6      6      6      6    ▁▁▇▁▁ 6 cyl 5 gears

$"8 cyl - 3 gears"
── Data Summary ────────────────────────
Values    
Name                       Piped data
Number of rows             32        
Number of columns          11        
_______________________              
Column type frequency:               
    numeric                  9         
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate   mean     sd     p0    p25    p50    p75   p100 hist  cyl   gear   
1 mpg                   0             1  15.0   2.77   10.4   14.0   15.2   16.6   19.2  ▃▂▇▃▃ 8 cyl 3 gears
2 disp                  0             1 358.   71.8   276.   297.   355    410    472    ▇▃▃▂▆ 8 cyl 3 gears
3 hp                    0             1 194.   33.4   150    175    180    219.   245    ▃▇▂▂▅ 8 cyl 3 gears
4 drat                  0             1   3.12  0.230   2.76   3.05   3.08   3.16   3.73 ▃▇▆▁▂ 8 cyl 3 gears
5 wt                    0             1   4.10  0.768   3.44   3.56   3.81   4.36   5.42 ▇▃▁▁▃ 8 cyl 3 gears
6 qsec                  0             1  17.1   0.802  15.4   17.0   17.4   17.7   18    ▃▁▂▇▆ 8 cyl 3 gears
7 vs                    0             1   0     0       0      0      0      0      0    ▁▁▇▁▁ 8 cyl 3 gears
8 am                    0             1   0     0       0      0      0      0      0    ▁▁▇▁▁ 8 cyl 3 gears
9 carb                  0             1   3.08  0.900   2      2      3      4      4    ▆▁▅▁▇ 8 cyl 3 gears

$"8 cyl - 5 gears"
── Data Summary ────────────────────────
Values    
Name                       Piped data
Number of rows             32        
Number of columns          11        
_______________________              
Column type frequency:               
    numeric                  9         
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate   mean      sd     p0    p25    p50    p75   p100 hist  cyl   gear   
1 mpg                   0             1  15.4   0.566   15     15.2   15.4   15.6   15.8  ▇▁▁▁▇ 8 cyl 5 gears
2 disp                  0             1 326    35.4    301    314.   326    338.   351    ▇▁▁▁▇ 8 cyl 5 gears
3 hp                    0             1 300.   50.2    264    282.   300.   317.   335    ▇▁▁▁▇ 8 cyl 5 gears
4 drat                  0             1   3.88  0.481    3.54   3.71   3.88   4.05   4.22 ▇▁▁▁▇ 8 cyl 5 gears
5 wt                    0             1   3.37  0.283    3.17   3.27   3.37   3.47   3.57 ▇▁▁▁▇ 8 cyl 5 gears
6 qsec                  0             1  14.6   0.0707  14.5   14.5   14.6   14.6   14.6  ▇▁▁▁▇ 8 cyl 5 gears
7 vs                    0             1   0     0        0      0      0      0      0    ▁▁▇▁▁ 8 cyl 5 gears
8 am                    0             1   1     0        1      1      1      1      1    ▁▁▇▁▁ 8 cyl 5 gears
9 carb                  0             1   6     2.83     4      5      6      7      8    ▇▁▁▁▇ 8 cyl 5 gears

`

michaelquinn32 commented 4 years ago

I think this is a good idea. Some comments:

  1. This is very similar to partition(). We could extent that function with an argument by_group, with a default value of FALSE.
  2. It would need its own bind() variant to reverse that operation.
  3. See CONTRIBUTING.md for some guidelines. a. If you can't get pre-commit to work, just be sure to call styler::style_pkg() before pushing the commits.
  4. Please don't include pipes in your functions. I hate debugging them.
  5. The step # get group_names to name list after group_split is a pmap_chr operation.
  6. Switch from lapply to purrr. We have a function for assigning skim attributes that might be helpful here: https://github.com/ropensci/skimr/blob/115d714612992bc3fe86d9fc526a57f5f610ba8e/R/skim_obj.R#L160

Otherwise, we can go into more details during a PR. @elinw How does this sound to you?

elinw commented 4 years ago

Hi I think it is an interesting concept for sure, and it's not so far from what skim does that it would violate the do one thing well concept.

You might want to also look here

https://github.com/elinw/skimrExtra https://github.com/elinw/skimrextra/blob/master/R/skim_with.R#L8

The idea being to be able to create a variable table like that in many articles.

I also wonder if looking at the codebook package which uses skimr extensively might be a way to go, keeping things more modular.

elinw commented 4 years ago

One interesting aspect of this is that it is including a separate summary for each group. I have been thinking a lot about summary and why it is that to make it not show in the print you have to currently explicitly use a print function. This is kind of a broad conundrum in R that piping doesn't really play well with at this moment.

I'm thinking that this output could be achieved more simply using our current vocabulary, the simple point being that it is approaching the skim by group problem from a different angle, one that was discussed initially at the #unconf actually and that @GShotwell had a lot of ideas about. At the time the whole model was different, but now that we have the one big data frame I do think we should revisit the option of taking the grouped output and returning it in list format bygroup and then maybe giving it a nice print method instead of just using list print.

Please keep in mind that mtcars is a terrible sample data set for skimr because it has only one type of variable. Use something like CO2 instead to really understand what an output would look like.

elinw commented 4 years ago

Are you definitely committed to having the summary by group? Because since we already support calculating the statistics on groups, playing with this a bit I think it is actually not hard to take the skim object for grouped data and pull out the subtables for the top n grouping levels (and leave any other grouping in place for those). Also remember that we already store the group names as an attribute.

elinw commented 4 years ago

This

#' @export
grouped_skims <- function(data, depth = 1) {
  assert_is_skim_df(data)
  groups <- group_names(data)

  if (length(groups) < 1){
    stop("Must have a least one group to use grouped_skims")
  }
  group_vars <- as.character(groups[1:depth])
  groups <- ifelse (depth >= length(groups),
                    groups[depth+1: length(groups)],
                    NULL)
  data$selected_groups <- do.call(paste, c(data[group_vars], sep="_"))
  data <- dplyr::select(data, -all_of(group_vars))
  base <- base_skimmers(data)
  data_as_list <- split(data, data$selected_groups)

  skimmers <- reconcile_skimmers(data, groups, base)
  reduced <- purrr::imap(data_as_list, simplify_skimdf, skimmers, groups, base)
  data_as_list <- purrr::map( data_as_list, dplyr::select, -selected_groups)

  reassign_skim_attrs(
    reduced,
    data,
    class = "skim_list",
    skimmers_used = skimmers
  )
   data_as_list
}

Gets you what you are getting now ... and I think that's a start. We'd need @michaelquinn32 to come up with a good verb for this. The one issue is that the nrows for the summary is wrong, but do we want summary here?? I think we would really want to consider the print ... right now it is doing skim print on the skim object in each element of the list, but I think it would be cleaner to return the tibble/data frame as e.g. results$virginica$skim_df and then results$virginica$summary. People will be confused by the print here. On the other hand maybe just make a nicer print function that doesn't look like a list.

Updated with cleaner code.

elinw commented 4 years ago

@michaelquinn32 and I discussed this and he proposed more of a tidyverse style semantic where the skimmed object is piped to a function that identifies the groups to be used for creating the subtables and then it is pipped to a variation of partition() based on those groups. I think this could work, we'd have to add an attribute for these groups and probably modify the groups attribute and the main skim data frame object.

Then maybe we can make a convenience function that combines all the steps.

kylebutts commented 4 years ago

I like the idea of including it in partition() as well. The basic reason I wanted the request was to make it easier to make a table that had group-names as the columns and the stats as the rows, rather than the tidy format of everything as a row. How can I help with this code?

elinw commented 4 years ago

Wait group names as columns? Can you show what you mean? Just make an example table/rough sketch.

I think I'm pretty far along in thinking about how to do one model of this that is more like partition (still rows but broken out by groups.

kylebutts commented 4 years ago

Yes, sorry I was unclear! I was trying to end up with a table like this and figured the easiest way would be to split by group into a list and then lapply to make each column. So we are thinking of the same idea (still rows but broken out by groups), but I would then do an extra step to get a table like below

image

elinw commented 1 year ago

Way back when we started skimr we played around with this a lot, but because skimr uses data frames and each data frame must be a single data type it is quite complex, which is why there are so many packages for making nice tables. I think what we could to is concentrate on producing data that would be processed by those.