stan-dev / posterior

The posterior R package
https://mc-stan.org/posterior/
Other
166 stars 24 forks source link

summary with one line per _named_ parameter #43

Open jgabry opened 5 years ago

jgabry commented 5 years ago

The discussion in #32 reminded me that one thing @andrewgelman and others have been wanting for a while (and that I think would also be useful as an alternative to the standard summary output) is a summary where each vector/matrix/array parameter only occupies a single line just like scalars. That is, each line corresponds to a named parameter rather than a parameter element.

(This could be its own thing or just an option to a custom print method for our existing summary objects.)

The quantities to display are debatable (e.g. maybe min of all the individual elements’ ESS, max of the Rhats, etc), so it would be good to get input from Andrew and anyone else interested.

jgabry commented 5 years ago

One possibility is something like what I put below (I think @andrewgelman and @avehtari and I discussed something like this one time when Aki was visiting). Not sure which columns we include, but a concise table like this would be a really nice option for models with many parameters:

variable   min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha[1]       4.5     4.5    1.0         881         300
tau[1]         3.8     3.8    1.1         386         311
omega[9]       1.5     7.4    1.0         552         272
Sigma[7,7]    -1.7     5.1    1.0         421         202

Here the numbers inside brackets indicate the dimensions (we could also omit if a scalar instead of putting [1]). At first people could confuse this with just showing the 9th element of omega, for example, but I think we could make it clear in the doc (and this also wouldn't be the default output so they would be intentionally selecting this). The columns min_mean and max_mean here are the min and max of the "mean" columns if the summary was computed for the individual elements (to give a sense of how much the means can differ) but I'm not sure if we want to include that or not.

@andrewgelman Did we end up coming up with a better idea than putting the dimensions in brackets?

Anyone else have any thoughts on this?

avehtari commented 5 years ago

Since brackets are used for indexing, this can cause confusion for R users. Would it be too much to have a column dim as the second column?

variable dim min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha    1       4.5     4.5    1.0         881         300
tau      1       3.8     3.8    1.1         386         311
omega    9       1.5     7.4    1.0         552         272
Sigma    7,7    -1.7     5.1    1.0         421         202
paul-buerkner commented 5 years ago

dim separately may be a little ugly and not not too helpful for filtering I would say as the structure would vary across variables, that is, sometimes 1 number sometimes 2 or more separated with commas.

jgabry commented 5 years ago

What if we just replace the brackets to a different symbol to refer to dimension? There are lots of possibilities. Here are just a few:

variable   min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha (1)       4.5     4.5    1.0         881         300
tau   (1)       3.8     3.8    1.1         386         311
omega (9)       1.5     7.4    1.0         552         272
Sigma (7,7)    -1.7     5.1    1.0         421         202
variable   min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha {1}       4.5     4.5    1.0         881         300
tau   {1}       3.8     3.8    1.1         386         311
omega {9}       1.5     7.4    1.0         552         272
Sigma {7,7}    -1.7     5.1    1.0         421         202
variable   min_mean max_mean max_rhat min_ess_bulk min_ess_tail
alpha <1>       4.5     4.5    1.0         881         300
tau   <1>       3.8     3.8    1.1         386         311
omega <9>       1.5     7.4    1.0         552         272
Sigma <7,7>    -1.7     5.1    1.0         421         202
mjskay commented 5 years ago

@paul-buerkner I'm not sure I understand the filtering issue with dim in a separate column --- it seems to me it would be easier to filter if dims were not a part of the variable column? E.g. filter(., variable == "Sigma")?

paul-buerkner commented 5 years ago

@mjskay you have a good point there. Not sure how to best represent the dimension, but you are right that filtering after variable should be easily possible.

andrewgelman commented 5 years ago

If we're gonna do one of these, I'd prefer the parentheses, as the braces look weird to me and the angle brackets make me thing of upper and lower. But if we're going to do it this way, what about indicating this in the header, thus:

variable (dims) min_mean max_mean max_rhat min_ess_bulk min_ess_tail

A

On Nov 13, 2019, at 11:34 AM, Jonah Gabry notifications@github.com wrote:

What if we just replace the brackets to a different symbol to refer to dimension? There are lots of possibilities. Here are just a few:

variable min_mean max_mean max_rhat min_ess_bulk min_ess_tail alpha (1) 4.5 4.5 1.0 881 300 tau (1) 3.8 3.8 1.1 386 311 omega (9) 1.5 7.4 1.0 552 272 Sigma (7,7) -1.7 5.1 1.0 421 202 variable min_mean max_mean max_rhat min_ess_bulk min_ess_tail alpha {1} 4.5 4.5 1.0 881 300 tau {1} 3.8 3.8 1.1 386 311 omega {9} 1.5 7.4 1.0 552 272 Sigma {7,7} -1.7 5.1 1.0 421 202 variable min_mean max_mean max_rhat min_ess_bulk min_ess_tail alpha <1> 4.5 4.5 1.0 881 300 tau <1> 3.8 3.8 1.1 386 311 omega <9> 1.5 7.4 1.0 552 272 Sigma <7,7> -1.7 5.1 1.0 421 202 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jgabry/posterior/issues/43?email_source=notifications&email_token=AAZYCUKLYHJ4CTB45ZWQV4TQTQUBFA5CNFSM4JKQOFF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED6XVOA#issuecomment-553482936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZYCUPJIO2MOH7YPDFUPHLQTQUBFANCNFSM4JKQOFFQ.

jsocolar commented 3 years ago

Hey guys, I'm still playing with ways to summarize the huge draws array that I'm working with, and realized that this issue is pretty relevant as well. I've approached the problem as below. If the approach looks not-crazy then I could try to clean it up and turn it into a PR. This would involve better error handling, inferring the dimensions of the nonscalar parameters, and wrangling the output into whatever format you want.

The default (which is tailored to Stan) is to "roll up" the summaries for any parameter with a [ in its name. There's also an option to pass the parameter names for rolling up, again with the assumption that they'll show up in the draws array as ^parameter_name\\[.*. The logic for passing parameter names is that nonscalar parameters come in (at least) two different flavors. One is big random effects and covariance matrices that ought to be rolled up. The other is vectors of regression coefficients probably should not be rolled up.

Functions:

summary_rollup <- function (draws_summary, rollup_vars = NULL,
                            min_only = c("ess_bulk", "ess_tail"),
                            max_only = "rhat") {
  vars <- draws_summary$variable

  if (is.null(rollup_vars)) {
    vars_nonscalar <- grepl("\\[", vars)
  } else {
    vars_nonscalar <- as.logical(colSums(do.call(rbind, lapply(paste0("^", rollup_vars, "\\["), function(x){grepl(x, vars)}))))
  }
  ds_scalar <- draws_summary[!vars_nonscalar, ]
  ds_nonscalar <- draws_summary[vars_nonscalar, ]
  varnames_nonscalar <- gsub("\\[(.*)", "", ds_nonscalar$variable)
  summary_names <- names(draws_summary)[-1]
  names_minmax <- summary_names[!(summary_names %in% c(min_only, max_only))]
  split_nonscalar <- split(ds_nonscalar, varnames_nonscalar)[unique(varnames_nonscalar)] # [unique(varnames_nonscalar)] preserves the order of the names

  min_max <- do.call(rbind, lapply(split_nonscalar, rollup_helper_minmax, names = names_minmax))
  min_only <- do.call(rbind, lapply(split_nonscalar, rollup_helper_min, names = min_only))
  max_only <- do.call(rbind, lapply(split_nonscalar, rollup_helper_max, names = max_only))

  nonscalar_out <- cbind(min_max, max_only, min_only)

  out <- list(unrolled = ds_scalar, rolled = nonscalar_out)
  out
}

rollup_helper_minmax <- function(x, names){
  x <- x[, names]
  mm <- c(apply(x, 2, function(x) {c(min(x), max(x))}))
  names(mm) <- paste0(rep(names(x), each = 2), c("_min", "_max"))
  mm
}

rollup_helper_min <- function(x, names){
  x <- x[, names]
  min_only <- apply(x, 2, min)
  names(min_only) <- paste0(names(x), "_min")
  min_only
}

rollup_helper_max <- function(x, names){
  x <- x[, names]
  max_only <- apply(x, 2, max)
  names(max_only) <- paste0(names(x), "_max")
  max_only
}

Example:

ds <- summarise_draws(example_draws())
ds2 <- summarise_draws(2 * example_draws())
ds2$variable <- paste0("new_", ds2$variable)
draws_summary <- rbind(ds, ds2)

summary_rollup(draws_summary)
summary_rollup(draws_summary, rollup_vars = c("theta", "new_theta"))
summary_rollup(draws_summary, rollup_vars = "theta")
mjskay commented 3 years ago

This looks like it could be very useful to me! Two quick thoughts if we go down this road:

  1. To be consistent with other functions I'd consider a verb-ish name, maybe rollup_summary() instead of summary_rollup()
  2. For dimension parsing, we probably want to have a consistent scheme for doing this wherever we have to do it, suggesting we at the very least should have an internal helper function and maybe even expose this externally if we think it would be useful. This will allow us to have consistent rules for doing it, and maybe add options in the future for people that need to parse non-standard dimension naming schemes. The beginnings of such a function could be made by factoring out the dimension-parsing code from as_draws_rvars.draws_matrix(), which already handles a number of weird cases (like missing elements, dimensions that don't start at 1, and non-numeric dimensions). Discussion of this aspect specifically should probably happen on #61