lauken13 / mrpkit

Tools and tutorials for multi-level regression and post-stratification of survey data
Other
10 stars 0 forks source link

summary method for SurveyFit class #63

Closed jgabry closed 1 year ago

jgabry commented 3 years ago

summarizing the resulting estimates (not the fitted model, which has it's own print and summary methods)

jgabry commented 3 years ago

Thinking more about this, any summary method will need access to the output from population_predict() so this will either need to be passed in to the summary method or regenerated internally. So perhaps a signature like this:

#' @param poststrat_estimates Optionally, the object returned by `population_predict` method.
#'   If not provided this is regenerated internally which will be slower for large models and data.
#' @param by Character vector of variable names. 
#' @param ... Arguments passed to `print` (e.g. `digits`).
fit$summary(poststrat_estimates, by = NULL, ...)

Internally summary would then aggregate as needed and summarize the aggregated estimates.

We also need to decide what kind of summaries to display for this. Do we want to do mean and sd for population and levels of any other variables specified via the by argument? Anything other than mean and sd?

Thoughts on this proposal?

jgabry commented 3 years ago

I made a quick draft implementation of my proposal on the summary-method branch. Here's an example of the output:

> fit_1$summary(by = "age", digits = 2)

Population estimate:
 mean    sd
 0.71 0.022

Estimates by group:
   age mean    sd
 18-35 0.59 0.064
 36-55 0.75 0.037
 56-65 0.76 0.032
   66+ 0.68 0.040

Thoughts on this?

lauken13 commented 3 years ago

Does this mean it will do the aggregation/weighting step? That can be quite slow so I think we might want to avoid that.

jgabry commented 3 years ago

Good point. I'll switch it so that it's the aggregated estimates that are passed and are only recomputed if the user doesn't pass them in. That way for small models/data the user doesn't need to worry about passing it in and for big models/data the user can pass in the aggregated estimates to avoid the extra computation. Does that sound ok?

On Thu, May 6, 2021 at 11:50 PM Lauren Kennedy @.***> wrote:

Does this mean it will do the aggregation/weighting step? That can be quite slow so I think we might want to avoid that.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mitzimorris/mrp-kit/issues/63#issuecomment-834085943, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3PQQY7JOD2GW5E5IKBXZ3TMN5QTANCNFSM44BX3V5Q .

jgabry commented 3 years ago

To pass in aggregated estimates means having to pass in two separate objects, the population estimates and any group-level estimates because aggregate does them separately. So it would be a signature along these lines:

#' @param population_estimates Optionally, population estimates returned by the
#'   `aggregate` method. If not provided this is regenerated internally, which
#'   may be slow for large models and data.
#' @param group_estimates Optionally, group estimates returned by the
#'   `aggregate` method. If not provided this is regenerated internally (if `by`
#'   is not NULL), which may be slow for large models and data.
#' @param by Character vector of variable names. If `group_estimates` is not
#'   provided then `by` is used to specify which variables to summarize by.
SurveyFit$summary(population_estimates, group_estimates, by = NULL)

This results in two different ways of using the summary method, one easier but inefficient and the other requiring more coding but much more efficient:

  1. Simplest way to call it but least efficient
# this will do population_predict and aggregate internally
fit$summary(by = "age")
  1. More work required by user but much more efficient
poststrat_ests <- fit$population_predict()
popn_ests <- fit$aggregate(poststrat_ests)
age_ests <- fit$aggregate(poststrat_ests, by = "age")

# this only does the summarizing, no prediction and aggregation
fit$summary(popn_ests, age_ests)
jgabry commented 3 years ago

I've been playing around with this more (on the summary-method branch) and here's the latest version I've come up with (no worries at all if you don't like it, I'm just experimenting). Various different behaviors are supported with different use cases and it is always possible to either pass in the aggregated estimates (to avoid extra computation) or let mrpkit recompute the aggregated estimates (less efficient but cleaner code):

Summarize only the population estimate

# computation done internally
fit_1$summary(digits = 1)

# computation done ahead of time
poststrat_estimates <- fit_1$population_predict()
estimates_popn <- fit_1$aggregate(poststrat_estimates)
fit_1$summary(estimates_popn, digits = 1)

In both cases the output looks like this:

Population estimate:
 mean   sd
  0.7 0.02

Summarize population and a single grouping variable

# computation done internally
fit_1$summary(by = "age", digits = 1)

# computation done ahead of time
poststrat_estimates <- fit_1$population_predict()
estimates_popn <- fit_1$aggregate(poststrat_estimates)
estimates_age <- fit_1$aggregate(poststrat_estimates, by = "age")
fit_1$summary(estimates_popn, estimates_age, digits = 1)

In both cases the output looks like this:

Population estimate:
 mean   sd
  0.7 0.02

Estimates by age:
   age mean   sd
 18-35  0.7 0.06
 36-55  0.8 0.03
 56-65  0.8 0.02
   66+  0.7 0.04

Summarize population and multiple grouping variables

# computation done internally
fit_1$summary(by = c("age", "gender"), digits = 1)

# computation done ahead of time
poststrat_estimates <- fit_1$population_predict()
estimates_popn <- fit_1$aggregate(poststrat_estimates)
estimates_age <- fit_1$aggregate(poststrat_estimates, by = "age")
estimates_gender <- fit_1$aggregate(poststrat_estimates, by = "gender")
fit_1$summary(estimates_popn, list(estimates_age, estimates_gender), digits = 1)

In both cases the output looks like this:

Population estimate:
 mean   sd
  0.7 0.02

Estimates by age:
   age mean   sd
 18-35  0.7 0.06
 36-55  0.8 0.03
 56-65  0.8 0.02
   66+  0.7 0.04

Estimates by gender:
    gender mean   sd
      male  0.8 0.02
    female  0.6 0.04
 nonbinary  0.6 0.09

When summarizing by multiple grouping variables using by the aggregate method is called multiple times internally (since it only accepts one by variable at a time currently).

When summarizing by multiple grouping variables using precomputed aggregated estimates, they are passed in as a list.

@lauken13 @mitzimorris @RohanAlexander @Dewi-Amaliah Thoughts on this proposal? I can keep working on it in advance of Monday if you have any feedback. Otherwise no worries, we can talk about it on Monday.

jgabry commented 3 years ago

After talking with @RohanAlexander and @Dewi-Amaliah in the last meeting we decided to also add a stats argument for specifying a list of summary statistics to compute. Any names provided in the list are used as the names of the resulting columns in the output (names are inferred if not provided in the case that function name is specified as a string, otherwise there's an error) :

This example demonstrates all the valid ways of specifying stats:

stats = list("mean", banana = sd, p20 = function(x) quantile(x, 0.2))

In this case the output would have columns "mean" (the mean), "banana" (the standard deviation), and "p20" (the 20th percentile). So this supports giving the name of a function as a string (e.g. "mean"), providing the function object itself (e.g. sd without quotes), or provided a function definition (function(x) ...).

jgabry commented 1 year ago

closing because we have a summary method now