ihmeuw-demographics / hierarchyUtils

Demographics Related Utility Functions
https://ihmeuw-demographics.github.io/hierarchyUtils/
BSD 3-Clause "New" or "Revised" License
8 stars 3 forks source link

Speed up aggregation of interval variables #73

Closed chacalle closed 3 years ago

chacalle commented 3 years ago

Describe changes

Speeds up aggregation of interval variables in input data where the interval sets for different combinations of id_cols can vary a lot. I will highlight the main changes in the code, a lot of the changes are just moving things around.

I made sure these timing comparisons stayed the same or sped up https://ihmeuw-demographics.github.io/hierarchyUtils/articles/agg_scale_performance.html#timing-comparison-1

And aggregated all population input data in under 1 minute

data <- demInternal::get_dem_outputs(
  process_name = "census raw data",
  gbd_year = 2020, name_cols = TRUE
)

# aggregate interval variable
test <- data[, list(location_id, year_id, sex_id, age_start, age_end, nid, underlying_nid, source_name, record_type_id, method, outlier_type, pes_adjustment_type_id, mean)]
test <- test[age_start != 999 & sex_id != 3]
output_dt <- agg(
  dt = test, 
  id_cols = setdiff(names(test), "mean"),
  value_cols = "mean",
  col_stem = "age", col_type = "interval", 
  mapping = data.table(age_start = c(0, 0, 80), age_end = c(Inf, 5, Inf)),
  present_agg_severity = "none",
  missing_dt_severity = "message"
)

# aggregate categorical variable
test <- data[, list(location_id, year_id, sex_id, age_start, age_end, nid, underlying_nid, source_name, record_type_id, method, outlier_type, pes_adjustment_type_id, mean)]
output_dt <- agg(
  dt = test, 
  id_cols = setdiff(names(test), "mean"),
  value_cols = "mean",
  col_stem = "sex_id", col_type = "categorical", 
  mapping = data.table(parent = 3, child = c(1, 2, 4)),
  present_agg_severity = "message",
  missing_dt_severity = "message"
)

What issues are related

Fixes #60 Fixes #64

Related to #51

Checklist

Packages Repositories