hubverse-org / hubEnsemblesManuscript

https://htmlpreview.github.io/?https://github.com/Infectious-Disease-Modeling-Hubs/hubEnsemblesManuscript/blob/master/analysis/paper/hubEnsembles_manuscript.html
Other
1 stars 2 forks source link

comments on manuscript #23

Closed nickreich closed 5 months ago

nickreich commented 7 months ago
elray1 commented 7 months ago

brief comment r.e. stacking -- stacking is a more general idea than linear pools, which includes (weighted) linear pools as a special case, but also weighted quantile averaging, etc. The idea of stacking is that you take the outputs/predictions from a group of "level 0" models (our component models) as the inputs to a "level 1" model (our ensemble method), using out-of-sample predictions from the level 0 models to train the level 1 model. I don't object to providing a reference to Wolpert, but that paper doesn't feel as relevant to me as the other things we're already citing like Stone -- Wolpert doesn't directly address anything probabilistic, but Stone is all about linear opinion pools. And Stone is also an earlier reference.

lshandross commented 7 months ago

About your questions on the case study section:

nickreich commented 7 months ago

I'd have to look at other "software demo" papers to know what the standards are here. It does seem complicated to "show all code" when there are quite a few bespoke steps to downloading and scoring the data. My initial take is that it's ok to have some of that processing excluded as long as

  1. things are well-set-up for reproducibility, e.g., the code for the data extraction and scoring is available and any pre-processed data live in the repo.
  2. we add a separate subsection of the paper that has a statement about reproducibility/software/data/links to repo.
eahowerton commented 6 months ago

@nickreich - for your second suggestion, let me make sure I understand correctly. I think you are suggesting that we add a table that will connect the mathematical descriptions provided in section 2 with the implementation details provided in section 3 (and perhaps consolidate the corresponding discussion that is currently scattered throughout the text). I think this is a really good idea. I see two potential options for implementation:

Option 1: for a given function, list the corresponding mathematical operation output_type simple_ensemble() linear_pool()
mean mean of individual model means mean of individual model means
median mean of individual model means NA
quantile quantile average probability average
cdf probability average probability average
pmf probability average probability average

Note, we could also add mathematical notation if it'd be useful.

Option 2: for a given mathematical operation, list the function that will perform it output_type quantile average probability average
quantile simple_ensemble() linear_pool()
cdf NA simple_ensemble() or linear_pool()
pmf NA simple_ensemble() or linear_pool()

A couple thoughts from me:

  1. I tend to think it is best to identify a mathematical operation that one would like to perform, and then find the corresponding function (this is better suited to the setup of Option 2). However, there is a more clear map between the hubEnsembles functions and each output type, which would favor Option 1. For example, there's no clear analog of quantile average for mean ouput type, which is why I haven't included mean and median output types in Option 2.
  2. The structure of Option 2 also raises a bigger question. While I don't think it makes sense to perform a quantile average in cases where pmf or cdf output types represent discrete variables, I think there also could be cases where these output types are used to discretize distributions of continuous random variables (just as we do with quantile output type). In the latter cases, one can imagine performing a quantile average (perhaps forecasts of peak timing could be an example), but I don't think hubEnsembles currently supports this. Is that right? In a case like this, would the user have to interpolate the CDFs externally before using hubEnsembles? @elray1 and @lshandross curious your thoughts on this point as well.
nickreich commented 6 months ago

Thanks for this carefully laid-out response, with specific options. What about a slight modification to your "Option 1" to include a column for each of the three implemented options for aggregation functions, like this:

Option 1A: for a given function and arguments, list the corresponding mathematical operation output_type simple_ensemble(..., agg_fun = "mean") simple_ensemble(..., agg_fun = "median") linear_pool()
mean mean of individual model means median of individual model means mean of individual model means
median mean of individual model medians median of individual model medians NA
quantile mean at each quantile level median at each quantile level average probability at each x
cdf mean cdf value at specified x's median cdf value at specified x's average probability at each x
pmf mean pmf value for each bin median pmf value for each bin mean pmf value for each bin

I changed the language in the table a bit in hopes of making it a bit more readable without notation, but I'm not sure it's an improvement. Specifically, I was finding it hard to read "quantile average" and "probability average" in the tables and get a picture immediately of what those operations were. I'm not sure that my proposed text is better or more accurate.

lshandross commented 6 months ago

I definitely like one of the two takes on Option 1 over Option 2 — I feel like listing the output_type as the first column makes the table more understandable and easy to follow. I also like Nick's addition to show the difference between two different aggregation functions for simple_ensemble and some of the language changes. However, I think we should be more explicit that simple_ensemble(..., agg_fun="mean") yields the same results as linear_pool for the cdf output types

elray1 commented 6 months ago

I'm ok with either orientation, agree with Emily's statement of the pros and cons.

some thoughts about language in 1a since it seems like that's the preferred option so far: can we aim for some formulaic language like one of the below, where

options for formulas using the above terms could be like:

eahowerton commented 6 months ago

I think that's a helpful suggestion @elray1. I also think @lshandross has a good point, that with more verbiage we risk losing the bigger picture a bit. It seems that there are two important conceptual ideas to convey with this table: (1) multiple functions give the same result for cdf and pmf output types; (2) the linear_pool() function outputs the same (theoretical) result regardless of output_type . Perhaps mixing in a bit of mathematical notation would help this jump out more?

Here's another version that incorporates the wording suggestion from @elray1 and tries to mix in some simple math:

Option 1B: output_type simple_ensemble(..., agg_fun = "mean") simple_ensemble(..., agg_fun = "median") linear_pool()
mean mean of individual model means median of individual model means mean of individual model means
median mean of individual model medians median of individual model medians NA
quantile mean of individual model target variable values at each quantile level, $F^{-1}_Q(\theta)$ median of individual target variable values at each quantile level mean of individual model target variable values at each quantile level, $F_{LOP}(x)$
cdf mean of individual model quantile levels at each target variable value, $F_{LOP}(x)$ median of individual model quantile levels at each target variable value mean of individual model quantile levels at each target variable value, $F_{LOP}(x)$
pmf mean of individual model quantile levels at each target variable value, $F_{LOP}(x)$ median of individual model quantile levels at each target variable value mean of individual model quantile levels at each target variable value, $F_{LOP}(x)$
elray1 commented 6 months ago

I like this latest iteration on the table, including the addition of the notation. Although it did feel funny that there was not notation in the first 2 rows or the 2nd column. But I understand that this is because we don't have convenient/brief notation for these settings...

elray1 commented 6 months ago

update -- for the cdf and pmf rows, to me it feels a bit clearer to write "mean of individual model probabilities at each ..."

lshandross commented 6 months ago

I also like this latest iteration of the table and agree with @elray1's suggestion to use "mean of individual model probabilities at each..." for the cdf and pmf rows.

The cell describing a linear pool for the quantile output type seems a bit confusing to me since the words are the same as that for the simple_ensemble one with a mean aggregation function. I think it should read something more like "mean of individual model quantile levels at each target variable value" (and then it fits nicely with the cdf and pmf cells beneath it)

elray1 commented 6 months ago
eahowerton commented 6 months ago

Good edits, thanks for catching my careless errors! Here's a new version:

output_type simple_ensemble(..., agg_fun = "mean") simple_ensemble(..., agg_fun = "median") linear_pool()
mean mean of individual model means median of individual model means mean of individual model means
median mean of individual model medians median of individual model medians NA
quantile mean of individual model target variable values at each quantile level, $F^{-1}_Q(\theta)$ median of individual target variable values at each quantile level mean of individual model target variable values at each quantile level, $F^{-1}_{LOP}(x)$
cdf mean of individual model probabilities at each target variable value, $F_{LOP}(x)$ median of individual model probabilities at each target variable value mean of individual model probabilities at each target variable value, $F_{LOP}(x)$
pmf mean of individual model probabilities at each target variable value, $f_{LOP}(x)$ median of individual model probabilities at each target variable value mean of individual model probabilities at each target variable value, $f_{LOP}(x)$

I agree it feels a bit strange that we only use notation in some cells. But I also agree it would probably be more effort/notation than it's worth to formalize something for every cell. A partial solution would be to remove the median column (but keep agg.fun = "mean" in the header of the column that remains). The median column feels a bit redundant to me, but I also see it's purpose so I'm fine either way.

nickreich commented 6 months ago

This has been a productive set of iterations! I think it's looking good! A few additional, very small, comments:

elray1 commented 6 months ago

I like it. for quantile/linear_pool, the text description still doesn't feel quite right. It says, "mean of individual model target variable values at each quantile level". but that sounds more like a description of a quantile averaging/Vincent approach

eahowerton commented 6 months ago

You're right @elray1, good catch. Here's the version (I think) we're settling on.

output_type simple_ensemble(..., agg_fun = "mean") linear_pool()
mean mean of individual model means mean of individual model means
median mean of individual model medians NA
quantile mean of individual model target variable values at each quantile level, $F^{-1}_Q(\theta)$ mean of individual model quantile levels at each target variable value, $F^{-1}_{LOP}(x)$
cdf mean of individual model cumulative probabilities at each target variable value, $F_{LOP}(x)$ mean of individual model cumulative probabilities at each target variable value, $F_{LOP}(x)$
pmf mean of individual model bin probabilities at each target variable value, $f_{LOP}(x)$ mean of individual model bin probabilities at each target variable value, $f_{LOP}(x)$

One more thought related to @nickreich's suggestion - is it confusing that we're using "cumulative probabilities" in the cdf row and "quantile levels" in quantile row, but we mean the same thing?

nickreich commented 6 months ago

@eahowerton I actually think that the text is correct as is. I always have to re-look at this page to make sure I get it right, but I think the format is that:

If the above is correct, then I think the table is good as is.

elray1 commented 6 months ago

clarifying Emily's comment a little to make sure we're on the same page -- we have these two equations:

  1. $F(x) = \theta$
  2. $F^{-1}(\theta) = x$

The variables $\theta$ and $x$ represent the same thing in these equations, but in the first we call $x$ a "target variable value" and $\theta$ a "cumulative probability", while in the second we call $x$ a "target variable value" in this table but often refer to it as a "quantile", and $\theta$ a "quantile level".

I think that no matter what we do here, it'll be confusing to someone. Maybe the best thing to do is to add something explaining this in the paper. For example, in the methods section, we have this sentence: "To define these two classes of methods, let (F(x)) be a cumulative density function (CDF) defined over values (x) of the target variable for the prediction, and (F^{-1}(\theta)) be the corresponding quantile function defined over quantile levels (\theta \in [0, 1])." Right after that, we could say something like, "Throughout this article, we may refer to $x$ as either 'a value of the target variable' or 'a quantile' depending on the context, and similarly we may refer to $\theta$ as either 'a quantile level' or 'a (cumulative) probability'."

elray1 commented 6 months ago

Double checking the quantile/linear_pool text again -- I would read "mean of individual model quantile levels at each target variable value" as a description of the computation $\frac{1}{N} \sum_i Fi(x)$, which is how we compute the LOP's cdf $F{LOP}(x)$. But when the output type is "quantile", we invert that cdf to return some quantiles. This is why in an earlier comment I suggested the notation $F_{LOP}^{-1}(\theta)$, indicating that the output is going to be on the scale of the target, i.e., "an $x$". And revising my earlier attempt at a text description, maybe we want something like "Quantile of the distribution obtained by computing the mean of estimated individual model cumulative probabilities at each target variable value". This is a mouthful and I'm not sure how helpful it really is, but it's an attempt to sum up in one sentence the 3-step process of (1) interpolating/extrapolating from quantiles to a full cdf; (2) forming the LOP; (3) getting quantiles of that LOP distribution.

eahowerton commented 6 months ago

Thanks for the clarification, @elray1, this is what I had meant. Adding a sentence like you suggest seems like a good solution to me.

RE your second comment, I see your point. I also agree that trying to convey all of this in the table could be difficult. What do you think about putting some of those details in the table caption, with an asterisk or footnote of some kind in the table itself? I think how we decide to handle this depends on what we want the purpose of this table to be: (1) explain exactly what operations are happening when a function is implemented for a particular output type, or (2) give higher-level similarities and differences between the function operations for different output types. My vote would be for (2), but I am open to alternative opinions.

If we opt for something like (2), I wonder if it would be helpful in the caption (or somewhere in the text) to guide the reader through the relationships between rows and columns in this table. I'm thinking something like: "For probabilistic output types (quantile, cdf, pmf), the output type (rows) determines how the resulting ensemble distribution is summarized (as a quantile $F^{-1}(\theta)$, cumulative distribution function $F(x)$, or probability mass function $f(x)$). The function (columns) determines what kind of ensemble distribution is generated (quantile average, $FQ(x)$ or linear pool $F{LOP}(x)$).

I'm not sure this is beautifully written, but hopefully you get the idea.

lshandross commented 6 months ago

I'm also inclined to agree with @eahowerton about option (2) of giving a higher level comparison in the table. We already discuss the need for extra steps in calculating a linear pool for quantile forecasts later in the paper, so perhaps a quick note in the table and reference to the correct subsection would be sufficient.

I also like the suggestion of guarding the reader through relationships between rows and columns in the table either in the caption or somewhere in the text (I don't have a strong preference of where it lives).

elray1 commented 6 months ago

I like option (2) for the table too, and the caption suggestion.

I do think we should continue to think about what goes in the text for that particular table cell. I'm on board with not trying to capture all the detail in a brief statement, but I think we should also be careful to ensure that any description we put there is an accurate description of the methods that are used there (or somehow defers and points the reader to a methods description elsewhere). Right now, the text reads to me like a description of the cdf/LOP methods rather than the quantile/LOP methods.

eahowerton commented 6 months ago

Yes, I think you're right @elray1, it's important to distinguish that cell from the cdf/LOP methods. After trying to come up with some other options, I think the text you suggest may be as concise as we can get. So I'm happy to use it in the quantile/LOP cell.

Let me try to summarize what we've decided on in this discussion:

  1. Add the following table:
output_type simple_ensemble(..., agg_fun = "mean") linear_pool()
mean mean of individual model means mean of individual model means
median mean of individual model medians NA
quantile mean of individual model target variable values at each quantile level, $F^{-1}_Q(\theta)$ quantile of the distribution obtained by computing the mean of estimated individual model cumulative probabilities at each target variable value, $F^{-1}_{LOP}(x)$
cdf mean of individual model cumulative probabilities at each target variable value, $F_{LOP}(x)$ mean of individual model cumulative probabilities at each target variable value, $F_{LOP}(x)$
pmf mean of individual model bin probabilities at each target variable value, $f_{LOP}(x)$ mean of individual model bin probabilities at each target variable value, $f_{LOP}(x)$
  1. In the caption of this table, include:

    • a brief orientation of the reader to the relationship between the rows/columns, something like: "For probabilistic output types (quantile, cdf, pmf), the output type (rows) determines how the resulting ensemble distribution is summarized (as a quantile $F^{-1}(\theta)$, cumulative distribution function $F(x)$, or probability mass function $f(x)$). The function (columns) determines what kind of operation is performed, and in turn what ensemble distribution is generated (quantile average $F^{-1}{Q}(\theta)$, or linear pool $F{LOP}(x)$)."
    • a mention of interpolation for quantile/linear_pool() cell and point to the relevant section where the details of this are discussed.
    • a note that using agg.fun = median would replace the mean with median in each description for simple_ensemble()
  2. Clarify terminology in the methods section. Add the second sentence suggested here (first sentence already in methods): "To define these two classes of methods, let (F(x)) be a cumulative density function (CDF) defined over values (x) of the target variable for the prediction, and (F^{-1}(\theta)) be the corresponding quantile function defined over quantile levels (\theta \in [0, 1]). Throughout this article, we may refer to as either 'a value of the target variable' or 'a quantile' depending on the context, and similarly we may refer to as either 'a quantile level' or 'a (cumulative) probability'."

Let me know if I've missed anything!

eahowerton commented 6 months ago

@lshandross I believe the first five comments in this list have been addressed. It seems you've been addressing the later comments along the way too, but didn't want to close the issue before checking with you.