hubverse-org / hubDocs

https://hubverse.io
5 stars 6 forks source link

`sample-output-type.md`: Clarifying questions and editing suggestions #169

Open zkamvar opened 2 months ago

zkamvar commented 2 months ago

I am having a hard time understanding the sample output type chapter. One of the disconnects is that there are concepts that are presented before they are ever introduced (e.g. compound tasks and the implicit compund_idx column) and there are sections that are only pertinent to compound modelling tasks that are not subsections of that section.

Introduction

The introduction for the sample output type needs reworking. From what I've found in the historical documents, it seems that the text in the introduction was written before the sample schema was fleshed out:

https://github.com/hubverse-org/hubDocs/blob/a2738159c43b5e4051e9e0cebf87b293895447ac/docs/source/user-guide/sample-output-type.md?plain=1#L3-L23

When I read it, I wonder, "Why are we talking about the mean output type? This is the sample output type."

Individual modeling tasks

Why is the compound_idx column here? It appears to be reiterating the grouping of the target column. Is this a column I should be worried about? The text says that it is implicit, but why does it have a name that indicates that it is a column that actually exists?

Why is column_idx not defined in the schema?

Compound modeling tasks

This description could be more specific to the example data presented, highlighting the columns in the text.

we will look at a hub reporting on variant proportions observed at a given location and at a given time. In the table below, a single modeling task is a unique combination of values from the task-id variables origin_date, horizon, variant, and location. In the table below, one set of four rows with the same values in the origin_date, horizon, and location columns but different variant values below represent four predicted variant proportions.

maybe:

we will look at a hub reporting on variant proportions observed in Massachusetts (location) on 2024-03-15 (origin_date) for 7 and 14 day forecasts (horizon). This is represented in the table below showing four variants (AA, BB, CC, and DD) represented over two horizons, giving us eight unique modelling tasks.

What does "Base data: mean output_type" mean?

https://github.com/hubverse-org/hubDocs/blob/a2738159c43b5e4051e9e0cebf87b293895447ac/docs/source/user-guide/sample-output-type.md?plain=1#L70

Four submissions

NOTE: For each submission, use level 4 headers, not bold text.

Pain points:

  1. the schema says minimum of 90 samples, but we present two samples each.
  2. the sample numbers continue to increase across (as opposed to strictly within) tasks. Why?

Submission A

I am confused as to why the sample numbers keep increasing across the stratification and why there are only two samples per stratum. Should each stratum contain at least 90 samples (according to the schema).

Submission B

I think I understand now that this is showing two samples per stratum, but I'm still confused as to why the sample numbers continue to increase after change in strata.

Are the values for each sample identical?

Submission C

In this example, there is a single compound modeling task which we can describe as “Massachusetts with the origin_date of 2024-03-15”.

"single compound modeling task" is confusing because there are two columns selected here. They do not vary, so it makes some sense in retrospect, but the initial read of this gives some roadblocks.

Submission D

The phrase "plain language" is a bit demotivating for a sentence with six prepositions.

Configuration of output_type_id

This description is clear (but it could use some trimming to reduce complexity) and the table illustrates the validity question better, but I feel that two things need to happen:

  1. The section should be renamed to "Configuration of compound_taskid_set
  2. The section should be a subsection of "Compound modelling tasks/"

I feel like this sentence is at conflict with the meaning of the schema (emphasis mine)

A hub can specify a "compound_taskid_set" field in the metadata for the sample output_type to specify the task-id columns that must be used to define separate sample index values (as present in the output_type_id column). The following table shows how different specifications of this field would impact the validity of each of the example submissions A, B, C, and D.

Given that submission C is valid for all of the schema configurations means that we should use "may" and not "must".

"columns that may be used to define"

Number of samples

As I indicated above, it doesn't make sense why each task gets two samples. Also, I believe this belongs in the "Compound Modelling Tasks" section.

Relationship to output_types

I think this needs to be a subsection of "Compound Modelling Tasks".

zkamvar commented 2 months ago

I spoke with @elray1 yesterday and he was able to clarify a couple of things:

  1. when I asked about the number of samples vs the schema: "Was there anything to indicate that these were valid submissions?" I had suspected that was the reason. If nothing else changes, this needs to change. When building understanding in a tutorial, everything must be explicit. The easiest way to do this is to change the schema to allow for a minimum of 2 samples.

  2. The compound_taskid_set are not "strata" (as I had incorrectly interpreted) and the concept behind them is not exactly straightforward. In reality the compound_taskid_set is important in what they do not include: the variables that are used for the joint distribution (I understand the concept of drawing samples from multivariate distributions, but I still need to wrap my head around what exactly this means). One major problem: joint distributions are only ever mentioned in one section of the documentation and that section is not in this chapter (it's in the Formats of Model Output section in the Model Output chapter). It needs to be clearly stated in this chapter.