Closed bjornerstedt closed 8 years ago
I’m not sure dplyr
is the right package for this. There are already nice functions in other packages doing what you want, e.g. describe()
in the psych
package, describe()
in the Hmisc
package, Desc()
(and friends) in the DescTools
package and various functions in the tableone
package.
Of course this has been done before. I used a modified version of psych::describe()
to reduce output before writing my own. I do think, however, that the functionality is closer to dplyr than psychometrics.
The function is only a couple of lines, giving a smaller output and eliminating the need to load another package. By the same logic the glimpse()
function is unnecessary, as there are str()
and view()
functions. Quckly getting a small output of the most important summary statistics really belongs to the central functionality of R.
I think the main challenge is to think through what makes sense for logical, character, factor, and date/time variables, and how many different columns you end up needing.
@rpruim any thoughts? Does mosaic have something like this?
Here is my opinion on this. (My code is not perfect. :-))
So I would suggest the same fields as in the current function, perhaps with the addition of a NA count column if any exist.
mosaic
doesn't currently have anything currently for summarizing an entire data frame in this way, but we could certainly think about adding such a thing.
We do provide dfapply()
that can apply a function to all variables in a data frame that match some criterion (by default all the numeric variables):
dfapply(KidsFeet, favstats)
## $birthmonth
## min Q1 median Q3 max mean sd n missing
## 1 3 6 9 12 6.102564 3.36229 39 0
##
## $birthyear
## min Q1 median Q3 max mean sd n missing
## 87 88 88 88 88 87.82051 0.3887764 39 0
##
## $length
## min Q1 median Q3 max mean sd n missing
## 21.6 24 24.5 25.6 27.5 24.72308 1.317586 39 0
##
## $width
## min Q1 median Q3 max mean sd n missing
## 7.9 8.65 9 9.35 9.8 8.992308 0.5095843 39 0
This makes it possible to do this:
do.call(rbind, dfapply(KidsFeet, favstats))
## min Q1 median Q3 max mean sd n missing
## birthmonth 1.0 3.00 6.0 9.00 12.0 6.102564 3.3622899 39 0
## birthyear 87.0 88.00 88.0 88.00 88.0 87.820513 0.3887764 39 0
## length 21.6 24.00 24.5 25.60 27.5 24.723077 1.3175858 39 0
## width 7.9 8.65 9.0 9.35 9.8 8.992308 0.5095843 39 0
(@hadley: using bind_rows()
loses the variable name information because you don't grab row names from the list names -- should dplyr::bind_rows()
do that?)
If favstats()
were replaced by something more general (a generic function that had methods for each type of variable), then we could do
dfapply( data, inspect, select = TRUE)
to get a list of summaries, which could perhaps be wrapped up into something for a better display.
Tabular displays are tricky, however, since the various data types should have different summaries. (I do not like applying standard numerical summaries to factors. I'd rather know things like how many levels, proportions of most common levels, etc.) So other than n
and missing
, I'm not sure there are any columns that make sense over all types of variables.
One option would be to have separate tables for each variable type.
@hadley, is there a fundamental reason why something like
data %>%
summarise( favstats(variable))
can't be made to work? I'm imagining a use case where the function (favstats
here) returns a named vector or a 1-row data frame in such a way that everything can be wrapped together with a call to rbind()
or bind_rows()
? It would sure be useful.
@rpruim see #154 - I'm not fundamentally opposed to it, we just haven't worked out a nice interface for it (and whether it's different enough from summarise to be it's own verb). The decision to drop row names in bind_rows()
is deliberate - it's to force you to store useful information as variables, which work make it easier to (e.g.) apply favstats by group, or to plot the results
Here's a proof of concept
inspect(Births78)
##
## categorical variables:
## name class levels missing n distribution
## 1 wday ordered 7 0 365 Sun (14.5%), Mon (14.2%), Tues (14.2%) ...
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 births integer 7135 8554 9218 9705 10711 9132 818 365 0
## 2 dayofyear integer 1 92 183 274 365 183 106 365 0
##
## time variables:
## name class first last min_diff max_diff missing n
## 1 date POSIXct 1978-01-01 1978-12-31 1 1 0 365
See https://github.com/ProjectMOSAIC/mosaic/issues/544 for further developments.
Here is a modified version of the describe()
function:
> data_frame(
+ txt=c("Hello","world!"),
+ txt2=c("oj", NA),
+ num=as.logical(c(0,1)),
+ t = as.Date(c("2015-12-12","2011-12-12"))
+ ) %>% describe()
Source: local data frame [4 x 7]
vars type mean sd min max n
1 txt (chr) 5.5 0.7071068 5 6 2
2 txt2 (chr) 2.0 NA 2 2 1
3 num (lgl) 0.5 0.7071068 0 1 2
4 t (date) 16050.5 1033.0830073 15320 16781 2
I would also suggest adding a column to display variable labels, or even another column to display whether a variable has value labels (from the haven
package). I've made a small package meda
to do just that.
# devtools::install_github("larmarange/labelled")
# devtools::install_github("jjchern/meda")
> library(meda)
> nlsw88 = haven::read_dta("http://www.stata-press.com/data/r13/nlsw88.dta")
>
> cb(nlsw88) # `cb` is short for `codebook`, and it shows summary statistics of a data frame
Source: local data frame [17 x 8]
var obs unique mean std.dev min max var_label
(chr) (int) (int) (dbl) (dbl) (dbl) (dbl) (chr)
1 idcode 2246 2246 2612.65 1480.86 1.00 5159.00 NLS id
2 age 2246 13 39.15 3.06 34.00 46.00 age in current year
3 race 2246 3 1.28 0.48 1.00 3.00 race
4 married 2246 2 0.64 0.48 0.00 1.00 married
5 never_married 2246 2 0.10 0.31 0.00 1.00 never married
6 grade 2244 16 13.10 2.52 0.00 18.00 current grade completed
7 collgrad 2246 2 0.24 0.43 0.00 1.00 college graduate
8 south 2246 2 0.42 0.49 0.00 1.00 lives in south
9 smsa 2246 2 0.70 0.46 0.00 1.00 lives in SMSA
10 c_city 2246 2 0.29 0.45 0.00 1.00 lives in central city
11 industry 2232 12 8.19 3.01 1.00 12.00 industry
12 occupation 2237 13 4.64 3.41 1.00 13.00 occupation
13 union 1878 2 0.25 0.43 0.00 1.00 union worker
14 wage 2246 967 7.77 5.76 1.00 40.75 hourly wage
15 hours 2242 62 37.22 10.51 1.00 80.00 usual hours worked
16 ttl_exp 2246 1546 12.53 4.61 0.12 28.88 total work experience
17 tenure 2231 259 5.98 5.51 0.00 25.92 job tenure (years)
>
> d(nlsw88) # `d` is shor for `describe` and it shows variable labels, and whether value label exists for certain variables
Source: local data frame [17 x 6]
var type class val_label label head
(chr) (chr) (chr) (lgl) (chr) (chr)
1 idcode int int FALSE NLS id 1 2 3 4 6...
2 age int int FALSE age in current year 37 37 42 43 42...
3 race int lbl TRUE race 2 2 2 1 1...
4 married int lbl TRUE married 0 0 0 1 1...
5 never_married int int FALSE never married 0 0 1 0 0...
6 grade int int FALSE current grade complete... 12 12 12 17 12...
7 collgrad int lbl TRUE college graduate 0 0 0 1 0...
8 south int int FALSE lives in south 0 0 0 0 0...
9 smsa int lbl TRUE lives in SMSA 1 1 1 1 1...
10 c_city int int FALSE lives in central city... 0 1 1 0 0...
11 industry int lbl TRUE industry 5 4 4 11 4...
12 occupation int lbl TRUE occupation 6 5 3 13 6...
13 union int lbl TRUE union worker 1 1 NA 1 0...
14 wage dbl nmr FALSE hourly wage 11.73912525177 6.40096...
15 hours int int FALSE usual hours worked 48 40 40 42 48...
16 ttl_exp dbl nmr FALSE total work experience... 10.3333339691162 13.62...
17 tenure dbl nmr FALSE job tenure (years) 5.33333349227905 5.25 ...
>
> # Note that there's a value label for the variable "race", thus we can checkout the values
> labelled::val_labels(nlsw88$race)
white black other
1 2 3
(Sorry for the command prompts and not using reprex()
; it's nice to see some color.)
Not a fan of cryptic 1- and 2-letter function names. If everyone did that, we would have even more name collisions than we already have.
Regarding the display of labels, I'm not sure they belong here as they tend to be long and so steal space from the other things we want to see. But I'm still deciding just what output to include.
meda::cb()
breaks for me on data frames with factors reporting:
Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, :
‘min’ not meaningful for factors
The default output should be as simple as possible, I think. Things like labels, quartiles and medians are good to have at times, but to me they are optional. Standard deviation is also something that I think should be optional. It is necessary in printed output (as are labels), but in working with data things like strange means or a many NA. I don't think std is something I look at. But as I said, this is only my opinion...
I think there's enough difference of opinion here to suggest that it's best such a function live in another package.
I agree that short function names are really bad. Coming from Stata, I've gotten used to short commands and seeing quick results. I'll fix the error for factors.
I also agree that the default output should be as simple as possible. Ultimately, I would really like to see something in the dplyr
that are better than the summary
function, even if just improved layouts.
"As simple as possible, but no simpler." (Einstein)
Sacrificing reasonableness or usefulness for simplicity is probably not a good trade. Computing means of factors is mostly silly. Treating everything as numeric is not a good direction. So I think you are left with separate treatment for (groups of) types of variables, or a summary that includes only things that always make sense (type, class, n, missing, etc) much like meta::d()
.
As @hadley suggests, there may be multiple takes on what such a summary should included and how it should be formatted. Indeed there are already a number of of these floating around -- including one more now that I've created one in mosaic
.
One advantage of having in dplyr
would be that it might be designed to work with non-local objects. Otherwise, I agree with @hadley, that this isn't necessarily a job for dplyr
.
Thank's for discussing the issue. I hope that the proof of concept of @rpruim is included in the mosaic package. Even though his suggestion is not as good as mine :-), it is already much better than the alternatives.
If anyone arrives here in 2017, there is also the skimr package:
I think that dplyr would benefit from having a function summarizing the data frame variables. It is surprising that the R base package has nothing better than the
summary
function to provide an overview of a data frame. In dplyr one can look at the data with for exampleglimpse
orhead
, but a concise display of key summary statistics would make data management easier. Summary statistics can provide more information than the raw data. For example one way to see that a join does not work is to look at the number of NA values.I have small function
describe()
available here that shows what I mean. It takes the same arguments asdplyr::select()
, and produces summary statistics as a data frame.The output is not only useful for looking at the data in R. It is also close to what is often shown as Table 1 in journal articles (at least in applied econometrics).