ihmeuw-demographics / hierarchyUtils

Demographics Related Utility Functions
https://ihmeuw-demographics.github.io/hierarchyUtils/
BSD 3-Clause "New" or "Revised" License
8 stars 3 forks source link

Discussion: aggregation function design #51

Closed chacalle closed 3 years ago

chacalle commented 3 years ago

Basic Description

The aggregation function should be used to aggregate values to different levels of a hierarchy. Input data dt is defined by a set of id_cols along with a column that is to be aggregated col_stem.

The column that is to be aggregated can be two different types of variables col_type

  1. Categorical variable like location or sex etc. col_type = categorical
  2. Numeric interval variable like age or year. These are defined by the start and end of each interval. col_type = interval

Two basic types of use cases

  1. Input data is expected to be "square" and is an exact match with the pre-defined hierarchy. Basic assertions need to be done and the function should be optimized for speed.
  2. Input data is not "square" and may not match up exactly with the pre-defined hierarchy. More detailed assertions and standardization need to be done.
    • Example 1: aggregating across locations, some years may have different sets of locations available.
    • Example 2: aggregating across locations, some locations may have different age groups available so need to be collapsed to the most detailed common age groups prior to each level of aggregation.
    • Example 3: aggregating across age groups, some locations may have different age groups available so need to map correctly from each detailed age group to the aggregate, and make sure each location has the entire expected age range.

Implementation Details

Assertions

Square datasets only

Non-square datasets only

What is the expected behavior when...

aggregates already exist in the input data? present_agg_severity

  1. Default is to throw error.
  2. Warn or ignore, then drop aggregates and continue.
  3. Skip the check and add the value to new aggregate values.

it is not possible to make an aggregate given the available input data? missing_dt_severity

For example when aggregating to a national location, one subnational may be missing.

  1. Default is to throw an error.
  2. Warn or ignore, then skips impossible aggregations and continues with others.
  3. Skip the check and make aggregation anyway.

when interval variables do not exactly match up in the input data? collapse_interval_cols

For example when aggregating to a national location, one subnational may have five year age groups and another has single year age groups.

  1. Default is to throw error.
  2. Option to automatically collapse to most detailed common intervals

when aggregating an interval variable, intervals overlap? Or when collapsing interval id_col variables, intervals overlap? overlapping_dt_severity

For example when aggregating to all-ages, but a certain combination of id_cols has both single-year and five-year age groups.

  1. Default is to throw error.
  2. Warn or ignore, then drops overlapping intervals and continues.
  3. Skip the check and continue with aggregation.

when value_cols have NA values like #49 na_value_severity

  1. Default is to throw error.
  2. Warn or ignore, then drop missing values and continue with aggregation.
  3. Skip check for NA values and include in aggregation.

Implementation steps

  1. Clean up testing script for aggregation (right now potentially too long and hard to follow).
  2. Add square argument to determine amount of flexibility in inputs.
  3. Add na_value_severity argument
krpaulson commented 3 years ago

present_agg_severity for non-square data should allow keeping of present aggregates when detailed inputs aren't present. Example would be data where source 1 has male and female rows and source 2 has all rows only. When aggregating to all you'd want to keep the present agg for source 2.

krpaulson commented 3 years ago

Is it possible to speed up both square and non-square aggregation by making this distinction?

chacalle commented 3 years ago

present_agg_severity for non-square data should allow keeping of present aggregates when detailed inputs aren't present. Example would be data where source 1 has male and female rows and source 2 has all rows only. When aggregating to all you'd want to keep the present agg for source 2.

Yeah and that should work if present_agg_severity = "skip"

chacalle commented 3 years ago

Is it possible to speed up both square and non-square aggregation by making this distinction?

I think the non-square aggregation speed depends on how detailed we want the checks and error messages to be. For example if aggregating to all-ages, do we want to check that every single combination of id_cols has age groups that cover the range from 0 to Inf. Related do we want the error message to identify the problematic subset of the data or just that in general there is something wrong.

krpaulson commented 3 years ago

@chacalle Do you want to create a separate discussion for scaling or should I post here?