Open zkamvar opened 2 months ago
I spoke with @elray1 yesterday and he was able to clarify a couple of things:
when I asked about the number of samples vs the schema: "Was there anything to indicate that these were valid submissions?" I had suspected that was the reason. If nothing else changes, this needs to change. When building understanding in a tutorial, everything must be explicit. The easiest way to do this is to change the schema to allow for a minimum of 2 samples.
The compound_taskid_set
are not "strata" (as I had incorrectly interpreted) and the concept behind them is not exactly straightforward. In reality the compound_taskid_set
is important in what they do not include: the variables that are used for the joint distribution (I understand the concept of drawing samples from multivariate distributions, but I still need to wrap my head around what exactly this means). One major problem: joint distributions are only ever mentioned in one section of the documentation and that section is not in this chapter (it's in the Formats of Model Output section in the Model Output chapter). It needs to be clearly stated in this chapter.
I am having a hard time understanding the sample output type chapter. One of the disconnects is that there are concepts that are presented before they are ever introduced (e.g. compound tasks and the implicit
compund_idx
column) and there are sections that are only pertinent to compound modelling tasks that are not subsections of that section.Introduction
The introduction for the sample output type needs reworking. From what I've found in the historical documents, it seems that the text in the introduction was written before the sample schema was fleshed out:
https://github.com/hubverse-org/hubDocs/blob/a2738159c43b5e4051e9e0cebf87b293895447ac/docs/source/user-guide/sample-output-type.md?plain=1#L3-L23
When I read it, I wonder, "Why are we talking about the mean output type? This is the sample output type."
Individual modeling tasks
Why is the
compound_idx
column here? It appears to be reiterating the grouping of the target column. Is this a column I should be worried about? The text says that it is implicit, but why does it have a name that indicates that it is a column that actually exists?Why is
column_idx
not defined in the schema?Compound modeling tasks
This description could be more specific to the example data presented, highlighting the columns in the text.
maybe:
What does "Base data: mean
output_type
" mean?https://github.com/hubverse-org/hubDocs/blob/a2738159c43b5e4051e9e0cebf87b293895447ac/docs/source/user-guide/sample-output-type.md?plain=1#L70
Four submissions
NOTE: For each submission, use level 4 headers, not bold text.
Pain points:
Submission A
I am confused as to why the sample numbers keep increasing across the stratification and why there are only two samples per stratum. Should each stratum contain at least 90 samples (according to the schema).
Submission B
I think I understand now that this is showing two samples per stratum, but I'm still confused as to why the sample numbers continue to increase after change in strata.
Are the values for each sample identical?
Submission C
"single compound modeling task" is confusing because there are two columns selected here. They do not vary, so it makes some sense in retrospect, but the initial read of this gives some roadblocks.
Submission D
The phrase "plain language" is a bit demotivating for a sentence with six prepositions.
Configuration of
output_type_id
This description is clear (but it could use some trimming to reduce complexity) and the table illustrates the validity question better, but I feel that two things need to happen:
compound_taskid_set
I feel like this sentence is at conflict with the meaning of the schema (emphasis mine)
Given that submission C is valid for all of the schema configurations means that we should use "may" and not "must".
"columns that may be used to define"
Number of samples
As I indicated above, it doesn't make sense why each task gets two samples. Also, I believe this belongs in the "Compound Modelling Tasks" section.
Relationship to output_types
I think this needs to be a subsection of "Compound Modelling Tasks".