ihmeuw-demographics / hierarchyUtils

Demographics Related Utility Functions
https://ihmeuw-demographics.github.io/hierarchyUtils/
BSD 3-Clause "New" or "Revised" License
8 stars 3 forks source link

Additional speed improvements for 'agg' from profiling #56

Closed chacalle closed 3 years ago

chacalle commented 3 years ago

Describe changes

I used R profiling tools to speed up the aggregation function when working with a large dataset. Included examples as comments to show how I profiled.

Here are the timings from the performance vignette with these changes #55

> agg_timings
    col_stem    col_type n_draws         method n_input_rows user.self sys.self elapsed
 1:      age    interval       1     data.table       13,632     0.081    0.005   0.088
 2:      age    interval       1 hierarchyUtils       13,632     4.001    0.213   4.269
 3:      age    interval      10     data.table      136,320     0.155    0.025   0.183
 4:      age    interval      10 hierarchyUtils      136,320     3.310    0.192   3.526
 5:      age    interval     100     data.table    1,363,200     0.739    0.235   0.989
 6:      age    interval     100 hierarchyUtils    1,363,200     9.141    1.084  10.406
 7:      age    interval    1000     data.table   13,632,000     6.993    2.573   9.762
 8:      age    interval    1000 hierarchyUtils   13,632,000    55.354    8.871  65.327
 9:      sex categorical       1     data.table       13,632     0.009    0.001   0.009
10:      sex categorical       1 hierarchyUtils       13,632     0.086    0.007   0.092
11:      sex categorical      10     data.table      136,320     0.063    0.009   0.073
12:      sex categorical      10 hierarchyUtils      136,320     0.759    0.048   0.817
13:      sex categorical     100     data.table    1,363,200     0.605    0.082   0.700
14:      sex categorical     100 hierarchyUtils    1,363,200     6.453    0.558   7.169
15:      sex categorical    1000     data.table   13,632,000     5.692    1.021   6.987
16:      sex categorical    1000 hierarchyUtils   13,632,000    57.967    5.711  65.056

What issues are related

Related to #47

Checklist

Packages Repositories

chacalle commented 3 years ago

I used the profvis R package to identify which parts of hierarchyUtils::agg were taking up the most time.

Under the hood it uses "uses data collected by Rprof, which is part of the base R distribution. At each time interval (profvis uses a default interval of 10ms), the profiler stops the R interpreter, looks at the current function call stack, and records it to a file. Because it works by sampling, the result isn’t deterministic. Each time you profile your code, the result will be slightly different."

profvis returns an interactive graphical interface that tells me about how much time was spent in each line and connects to the exact lines of code in the package. For example below I can see that over the entire profiling period about ~18 seconds were spent within check_agg_scale_subtree_dt. This allows me to pick exact spots to speed up.

Screen Shot 2020-12-09 at 12 33 20 PM

library(hierarchyUtils)
library(data.table)
library(profvis)

n_draws <- 1000

# default variables for aggregation timings
age_mapping <- data.table(age_start = c(0, seq(0, 90, 5)), age_end = c(Inf, seq(5, 95, 5)))
sex_mapping <- data.table(parent = "all", child = c("male", "female"))
agg_id_vars <- list(
  location = 1,
  year_start = seq(1950, 2020, 1),
  sex = c("male", "female"),
  age_start = seq(0, 95, 1),
  value1 = 1, value2 = 1
)

# create input dataset
agg_id_vars <- copy(agg_id_vars)
agg_id_vars[["draw"]] <- 1:n_draws
input_dt <- do.call(CJ, agg_id_vars)

# add interval end columns
input_dt[, year_end := year_start + 1]
input_dt[, age_end := age_start + 1]
input_dt[age_start == 95, age_end := Inf]

# identify value and id cols
value_cols <- grep("value", names(input_dt), value = TRUE)
id_cols <- names(input_dt)[!names(input_dt) %in% value_cols]

profvis::profvis(
  expr = {
    hierarchyUtils_output_dt <- agg(
      dt = input_dt,
      id_cols = id_cols, value_cols = value_cols,
      col_stem = "age", col_type = "interval",
      mapping = age_mapping
    )
  },
  interval = 0.005
)
krpaulson commented 3 years ago

In the PR description, maybe include the speed vignette output before the changes as well as after the changes? Not crucial here, but could be useful if you create similar PRs in the future.

chacalle commented 3 years ago

Added a section to the packageTemplate wiki now https://github.com/ihmeuw-demographics/packageTemplate/wiki/Profiling-R-Functions