Closed chacalle closed 3 years ago
present_agg_severity
for non-square data should allow keeping of present aggregates when detailed inputs aren't present. Example would be data where source 1 has male
and female
rows and source 2 has all
rows only. When aggregating to all
you'd want to keep the present agg for source 2.
Is it possible to speed up both square and non-square aggregation by making this distinction?
present_agg_severity
for non-square data should allow keeping of present aggregates when detailed inputs aren't present. Example would be data where source 1 hasmale
andfemale
rows and source 2 hasall
rows only. When aggregating toall
you'd want to keep the present agg for source 2.
Yeah and that should work if present_agg_severity = "skip"
Is it possible to speed up both square and non-square aggregation by making this distinction?
I think the non-square aggregation speed depends on how detailed we want the checks and error messages to be. For example if aggregating to all-ages, do we want to check that every single combination of id_cols
has age groups that cover the range from 0 to Inf. Related do we want the error message to identify the problematic subset of the data or just that in general there is something wrong.
@chacalle Do you want to create a separate discussion for scaling or should I post here?
Basic Description
The aggregation function should be used to aggregate values to different levels of a hierarchy. Input data
dt
is defined by a set ofid_cols
along with a column that is to be aggregatedcol_stem
.The column that is to be aggregated can be two different types of variables
col_type
col_type = categorical
col_type = interval
Two basic types of use cases
Implementation Details
Assertions
Square datasets only
id_cols
exist.Non-square datasets only
What is the expected behavior when...
aggregates already exist in the input data?
present_agg_severity
it is not possible to make an aggregate given the available input data?
missing_dt_severity
For example when aggregating to a national location, one subnational may be missing.
when interval variables do not exactly match up in the input data?
collapse_interval_cols
For example when aggregating to a national location, one subnational may have five year age groups and another has single year age groups.
when aggregating an interval variable, intervals overlap? Or when collapsing interval
id_col
variables, intervals overlap?overlapping_dt_severity
For example when aggregating to all-ages, but a certain combination of
id_cols
has both single-year and five-year age groups.when
value_cols
haveNA
values like #49na_value_severity
NA
values and include in aggregation.Implementation steps
square
argument to determine amount of flexibility in inputs.na_value_severity
argument