Closed jgabry closed 1 year ago
Thinking more about this, any summary method will need access to the output from population_predict()
so this will either need to be passed in to the summary method or regenerated internally. So perhaps a signature like this:
#' @param poststrat_estimates Optionally, the object returned by `population_predict` method.
#' If not provided this is regenerated internally which will be slower for large models and data.
#' @param by Character vector of variable names.
#' @param ... Arguments passed to `print` (e.g. `digits`).
fit$summary(poststrat_estimates, by = NULL, ...)
Internally summary would then aggregate as needed and summarize the aggregated estimates.
We also need to decide what kind of summaries to display for this. Do we want to do mean and sd for population and levels of any other variables specified via the by
argument? Anything other than mean and sd?
Thoughts on this proposal?
I made a quick draft implementation of my proposal on the summary-method
branch. Here's an example of the output:
> fit_1$summary(by = "age", digits = 2)
Population estimate:
mean sd
0.71 0.022
Estimates by group:
age mean sd
18-35 0.59 0.064
36-55 0.75 0.037
56-65 0.76 0.032
66+ 0.68 0.040
Thoughts on this?
Does this mean it will do the aggregation/weighting step? That can be quite slow so I think we might want to avoid that.
Good point. I'll switch it so that it's the aggregated estimates that are passed and are only recomputed if the user doesn't pass them in. That way for small models/data the user doesn't need to worry about passing it in and for big models/data the user can pass in the aggregated estimates to avoid the extra computation. Does that sound ok?
On Thu, May 6, 2021 at 11:50 PM Lauren Kennedy @.***> wrote:
Does this mean it will do the aggregation/weighting step? That can be quite slow so I think we might want to avoid that.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mitzimorris/mrp-kit/issues/63#issuecomment-834085943, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3PQQY7JOD2GW5E5IKBXZ3TMN5QTANCNFSM44BX3V5Q .
To pass in aggregated estimates means having to pass in two separate objects, the population estimates and any group-level estimates because aggregate does them separately. So it would be a signature along these lines:
#' @param population_estimates Optionally, population estimates returned by the
#' `aggregate` method. If not provided this is regenerated internally, which
#' may be slow for large models and data.
#' @param group_estimates Optionally, group estimates returned by the
#' `aggregate` method. If not provided this is regenerated internally (if `by`
#' is not NULL), which may be slow for large models and data.
#' @param by Character vector of variable names. If `group_estimates` is not
#' provided then `by` is used to specify which variables to summarize by.
SurveyFit$summary(population_estimates, group_estimates, by = NULL)
This results in two different ways of using the summary method, one easier but inefficient and the other requiring more coding but much more efficient:
# this will do population_predict and aggregate internally
fit$summary(by = "age")
poststrat_ests <- fit$population_predict()
popn_ests <- fit$aggregate(poststrat_ests)
age_ests <- fit$aggregate(poststrat_ests, by = "age")
# this only does the summarizing, no prediction and aggregation
fit$summary(popn_ests, age_ests)
I've been playing around with this more (on the summary-method branch) and here's the latest version I've come up with (no worries at all if you don't like it, I'm just experimenting). Various different behaviors are supported with different use cases and it is always possible to either pass in the aggregated estimates (to avoid extra computation) or let mrpkit recompute the aggregated estimates (less efficient but cleaner code):
# computation done internally
fit_1$summary(digits = 1)
# computation done ahead of time
poststrat_estimates <- fit_1$population_predict()
estimates_popn <- fit_1$aggregate(poststrat_estimates)
fit_1$summary(estimates_popn, digits = 1)
In both cases the output looks like this:
Population estimate:
mean sd
0.7 0.02
# computation done internally
fit_1$summary(by = "age", digits = 1)
# computation done ahead of time
poststrat_estimates <- fit_1$population_predict()
estimates_popn <- fit_1$aggregate(poststrat_estimates)
estimates_age <- fit_1$aggregate(poststrat_estimates, by = "age")
fit_1$summary(estimates_popn, estimates_age, digits = 1)
In both cases the output looks like this:
Population estimate:
mean sd
0.7 0.02
Estimates by age:
age mean sd
18-35 0.7 0.06
36-55 0.8 0.03
56-65 0.8 0.02
66+ 0.7 0.04
# computation done internally
fit_1$summary(by = c("age", "gender"), digits = 1)
# computation done ahead of time
poststrat_estimates <- fit_1$population_predict()
estimates_popn <- fit_1$aggregate(poststrat_estimates)
estimates_age <- fit_1$aggregate(poststrat_estimates, by = "age")
estimates_gender <- fit_1$aggregate(poststrat_estimates, by = "gender")
fit_1$summary(estimates_popn, list(estimates_age, estimates_gender), digits = 1)
In both cases the output looks like this:
Population estimate:
mean sd
0.7 0.02
Estimates by age:
age mean sd
18-35 0.7 0.06
36-55 0.8 0.03
56-65 0.8 0.02
66+ 0.7 0.04
Estimates by gender:
gender mean sd
male 0.8 0.02
female 0.6 0.04
nonbinary 0.6 0.09
When summarizing by multiple grouping variables using by
the aggregate
method is called multiple times internally (since it only accepts one by
variable at a time currently).
When summarizing by multiple grouping variables using precomputed aggregated estimates, they are passed in as a list.
@lauken13 @mitzimorris @RohanAlexander @Dewi-Amaliah Thoughts on this proposal? I can keep working on it in advance of Monday if you have any feedback. Otherwise no worries, we can talk about it on Monday.
After talking with @RohanAlexander and @Dewi-Amaliah in the last meeting we decided to also add a stats
argument for specifying a list of summary statistics to compute. Any names provided in the list are used as the names of the resulting columns in the output (names are inferred if not provided in the case that function name is specified as a string, otherwise there's an error) :
This example demonstrates all the valid ways of specifying stats
:
stats = list("mean", banana = sd, p20 = function(x) quantile(x, 0.2))
In this case the output would have columns "mean"
(the mean), "banana"
(the standard deviation), and "p20"
(the 20th percentile). So this supports giving the name of a function as a string (e.g. "mean"), providing the function object itself (e.g. sd
without quotes), or provided a function definition (function(x) ...
).
closing because we have a summary method now
summarizing the resulting estimates (not the fitted model, which has it's own print and summary methods)