tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.74k stars 2.12k forks source link

A better summary function #1514

Closed bjornerstedt closed 8 years ago

bjornerstedt commented 8 years ago

I think that dplyr would benefit from having a function summarizing the data frame variables. It is surprising that the R base package has nothing better than the summary function to provide an overview of a data frame. In dplyr one can look at the data with for example glimpse or head, but a concise display of key summary statistics would make data management easier. Summary statistics can provide more information than the raw data. For example one way to see that a join does not work is to look at the number of NA values.

I have small function describe() available here that shows what I mean. It takes the same arguments as dplyr::select(), and produces summary statistics as a data frame.

describe(iris)

##                  mean min max   n   factor
## Sepal.Length 5.843333 4.3 7.9 150         
## Sepal.Width  3.057333 2.0 4.4 150         
## Petal.Length 3.758000 1.0 6.9 150         
## Petal.Width  1.199333 0.1 2.5 150         
## Species      2.000000 1.0 3.0 150 3 levels

describe(iris, matches("Sep"))

##                  mean min max   n
## Sepal.Length 5.843333 4.3 7.9 150
## Sepal.Width  3.057333 2.0 4.4 150

The output is not only useful for looking at the data in R. It is also close to what is often shown as Table 1 in journal articles (at least in applied econometrics).

huftis commented 8 years ago

I’m not sure dplyr is the right package for this. There are already nice functions in other packages doing what you want, e.g. describe() in the psych package, describe() in the Hmisc package, Desc() (and friends) in the DescTools package and various functions in the tableone package.

bjornerstedt commented 8 years ago

Of course this has been done before. I used a modified version of psych::describe() to reduce output before writing my own. I do think, however, that the functionality is closer to dplyr than psychometrics.

The function is only a couple of lines, giving a smaller output and eliminating the need to load another package. By the same logic the glimpse() function is unnecessary, as there are str() and view() functions. Quckly getting a small output of the most important summary statistics really belongs to the central functionality of R.

hadley commented 8 years ago

I think the main challenge is to think through what makes sense for logical, character, factor, and date/time variables, and how many different columns you end up needing.

@rpruim any thoughts? Does mosaic have something like this?

bjornerstedt commented 8 years ago

Here is my opinion on this. (My code is not perfect. :-))

So I would suggest the same fields as in the current function, perhaps with the addition of a NA count column if any exist.

rpruim commented 8 years ago

mosaic doesn't currently have anything currently for summarizing an entire data frame in this way, but we could certainly think about adding such a thing.

We do provide dfapply() that can apply a function to all variables in a data frame that match some criterion (by default all the numeric variables):

dfapply(KidsFeet, favstats)
## $birthmonth
##  min Q1 median Q3 max     mean      sd  n missing
##    1  3      6  9  12 6.102564 3.36229 39       0
## 
## $birthyear
##  min Q1 median Q3 max     mean        sd  n missing
##   87 88     88 88  88 87.82051 0.3887764 39       0
## 
## $length
##   min Q1 median   Q3  max     mean       sd  n missing
##  21.6 24   24.5 25.6 27.5 24.72308 1.317586 39       0
## 
## $width
##  min   Q1 median   Q3 max     mean        sd  n missing
##  7.9 8.65      9 9.35 9.8 8.992308 0.5095843 39       0

This makes it possible to do this:

do.call(rbind, dfapply(KidsFeet, favstats))
##             min    Q1 median    Q3  max      mean        sd  n missing
## birthmonth  1.0  3.00    6.0  9.00 12.0  6.102564 3.3622899 39       0
## birthyear  87.0 88.00   88.0 88.00 88.0 87.820513 0.3887764 39       0
## length     21.6 24.00   24.5 25.60 27.5 24.723077 1.3175858 39       0
## width       7.9  8.65    9.0  9.35  9.8  8.992308 0.5095843 39       0

(@hadley: using bind_rows() loses the variable name information because you don't grab row names from the list names -- should dplyr::bind_rows() do that?)

If favstats() were replaced by something more general (a generic function that had methods for each type of variable), then we could do

dfapply( data, inspect, select = TRUE)

to get a list of summaries, which could perhaps be wrapped up into something for a better display.

Tabular displays are tricky, however, since the various data types should have different summaries. (I do not like applying standard numerical summaries to factors. I'd rather know things like how many levels, proportions of most common levels, etc.) So other than n and missing, I'm not sure there are any columns that make sense over all types of variables.

One option would be to have separate tables for each variable type.

rpruim commented 8 years ago

@hadley, is there a fundamental reason why something like

data %>%
  summarise( favstats(variable))

can't be made to work? I'm imagining a use case where the function (favstats here) returns a named vector or a 1-row data frame in such a way that everything can be wrapped together with a call to rbind() or bind_rows()? It would sure be useful.

hadley commented 8 years ago

@rpruim see #154 - I'm not fundamentally opposed to it, we just haven't worked out a nice interface for it (and whether it's different enough from summarise to be it's own verb). The decision to drop row names in bind_rows() is deliberate - it's to force you to store useful information as variables, which work make it easier to (e.g.) apply favstats by group, or to plot the results

rpruim commented 8 years ago

Here's a proof of concept

inspect(Births78)
## 
## categorical variables:  
##   name   class levels missing   n                                  distribution
## 1 wday ordered      7       0 365 Sun (14.5%), Mon (14.2%), Tues (14.2%) ...   
## 
## quantitative variables:  
##        name   class  min   Q1 median   Q3   max mean  sd   n missing
## 1    births integer 7135 8554   9218 9705 10711 9132 818 365       0
## 2 dayofyear integer    1   92    183  274   365  183 106 365       0
## 
## time variables:  
##   name   class      first       last min_diff max_diff missing   n
## 1 date POSIXct 1978-01-01 1978-12-31        1        1       0 365

See https://github.com/ProjectMOSAIC/mosaic/issues/544 for further developments.

bjornerstedt commented 8 years ago

Here is a modified version of the describe() function:

> data_frame(
+   txt=c("Hello","world!"),
+   txt2=c("oj", NA),
+   num=as.logical(c(0,1)),
+   t = as.Date(c("2015-12-12","2011-12-12"))
+   )  %>% describe()
Source: local data frame [4 x 7]

  vars   type    mean           sd   min   max n
1  txt  (chr)     5.5    0.7071068     5     6 2
2 txt2  (chr)     2.0           NA     2     2 1
3  num  (lgl)     0.5    0.7071068     0     1 2
4    t (date) 16050.5 1033.0830073 15320 16781 2
jjchern commented 8 years ago

I would also suggest adding a column to display variable labels, or even another column to display whether a variable has value labels (from the haven package). I've made a small package meda to do just that.

# devtools::install_github("larmarange/labelled")
# devtools::install_github("jjchern/meda")

> library(meda)
> nlsw88 = haven::read_dta("http://www.stata-press.com/data/r13/nlsw88.dta")
> 
> cb(nlsw88) # `cb` is short for `codebook`, and it shows summary statistics of a data frame
Source: local data frame [17 x 8]

             var   obs unique    mean std.dev   min     max               var_label
           (chr) (int)  (int)   (dbl)   (dbl) (dbl)   (dbl)                   (chr)
1         idcode  2246   2246 2612.65 1480.86  1.00 5159.00                  NLS id
2            age  2246     13   39.15    3.06 34.00   46.00     age in current year
3           race  2246      3    1.28    0.48  1.00    3.00                    race
4        married  2246      2    0.64    0.48  0.00    1.00                 married
5  never_married  2246      2    0.10    0.31  0.00    1.00           never married
6          grade  2244     16   13.10    2.52  0.00   18.00 current grade completed
7       collgrad  2246      2    0.24    0.43  0.00    1.00        college graduate
8          south  2246      2    0.42    0.49  0.00    1.00          lives in south
9           smsa  2246      2    0.70    0.46  0.00    1.00           lives in SMSA
10        c_city  2246      2    0.29    0.45  0.00    1.00   lives in central city
11      industry  2232     12    8.19    3.01  1.00   12.00                industry
12    occupation  2237     13    4.64    3.41  1.00   13.00              occupation
13         union  1878      2    0.25    0.43  0.00    1.00            union worker
14          wage  2246    967    7.77    5.76  1.00   40.75             hourly wage
15         hours  2242     62   37.22   10.51  1.00   80.00      usual hours worked
16       ttl_exp  2246   1546   12.53    4.61  0.12   28.88   total work experience
17        tenure  2231    259    5.98    5.51  0.00   25.92      job tenure (years)
> 
> d(nlsw88) # `d` is shor for `describe` and it shows variable labels, and whether value label exists for certain variables
Source: local data frame [17 x 6]

             var  type class val_label                     label                      head
           (chr) (chr) (chr)     (lgl)                     (chr)                     (chr)
1         idcode   int   int     FALSE                    NLS id              1 2 3 4 6...
2            age   int   int     FALSE       age in current year         37 37 42 43 42...
3           race   int   lbl      TRUE                      race              2 2 2 1 1...
4        married   int   lbl      TRUE                   married              0 0 0 1 1...
5  never_married   int   int     FALSE             never married              0 0 1 0 0...
6          grade   int   int     FALSE current grade complete...         12 12 12 17 12...
7       collgrad   int   lbl      TRUE          college graduate              0 0 0 1 0...
8          south   int   int     FALSE            lives in south              0 0 0 0 0...
9           smsa   int   lbl      TRUE             lives in SMSA              1 1 1 1 1...
10        c_city   int   int     FALSE  lives in central city...              0 1 1 0 0...
11      industry   int   lbl      TRUE                  industry             5 4 4 11 4...
12    occupation   int   lbl      TRUE                occupation             6 5 3 13 6...
13         union   int   lbl      TRUE              union worker             1 1 NA 1 0...
14          wage   dbl   nmr     FALSE               hourly wage 11.73912525177 6.40096...
15         hours   int   int     FALSE        usual hours worked         48 40 40 42 48...
16       ttl_exp   dbl   nmr     FALSE  total work experience... 10.3333339691162 13.62...
17        tenure   dbl   nmr     FALSE        job tenure (years) 5.33333349227905 5.25 ...
> 
> # Note that there's a value label for the variable "race", thus we can checkout the values
> labelled::val_labels(nlsw88$race)
white black other 
    1     2     3 

(Sorry for the command prompts and not using reprex(); it's nice to see some color.)

rpruim commented 8 years ago

Not a fan of cryptic 1- and 2-letter function names. If everyone did that, we would have even more name collisions than we already have.

Regarding the display of labels, I'm not sure they belong here as they tend to be long and so steal space from the other things we want to see. But I'm still deciding just what output to include.

meda::cb() breaks for me on data frames with factors reporting:

Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L,  : 
  ‘min’ not meaningful for factors
bjornerstedt commented 8 years ago

The default output should be as simple as possible, I think. Things like labels, quartiles and medians are good to have at times, but to me they are optional. Standard deviation is also something that I think should be optional. It is necessary in printed output (as are labels), but in working with data things like strange means or a many NA. I don't think std is something I look at. But as I said, this is only my opinion...

hadley commented 8 years ago

I think there's enough difference of opinion here to suggest that it's best such a function live in another package.

jjchern commented 8 years ago

I agree that short function names are really bad. Coming from Stata, I've gotten used to short commands and seeing quick results. I'll fix the error for factors.

I also agree that the default output should be as simple as possible. Ultimately, I would really like to see something in the dplyr that are better than the summary function, even if just improved layouts.

rpruim commented 8 years ago

"As simple as possible, but no simpler." (Einstein)

Sacrificing reasonableness or usefulness for simplicity is probably not a good trade. Computing means of factors is mostly silly. Treating everything as numeric is not a good direction. So I think you are left with separate treatment for (groups of) types of variables, or a summary that includes only things that always make sense (type, class, n, missing, etc) much like meta::d().

As @hadley suggests, there may be multiple takes on what such a summary should included and how it should be formatted. Indeed there are already a number of of these floating around -- including one more now that I've created one in mosaic.

One advantage of having in dplyr would be that it might be designed to work with non-local objects. Otherwise, I agree with @hadley, that this isn't necessarily a job for dplyr.

bjornerstedt commented 8 years ago

Thank's for discussing the issue. I hope that the proof of concept of @rpruim is included in the mosaic package. Even though his suggestion is not as good as mine :-), it is already much better than the alternatives.

adriaaula commented 6 years ago

If anyone arrives here in 2017, there is also the skimr package:

https://ropensci.org/blog/2017/07/11/skimr/