Closed chacalle closed 3 years ago
I used the profvis R package to identify which parts of hierarchyUtils::agg
were taking up the most time.
Under the hood it uses "uses data collected by Rprof, which is part of the base R distribution. At each time interval (profvis uses a default interval of 10ms), the profiler stops the R interpreter, looks at the current function call stack, and records it to a file. Because it works by sampling, the result isn’t deterministic. Each time you profile your code, the result will be slightly different."
profvis
returns an interactive graphical interface that tells me about how much time was spent in each line and connects to the exact lines of code in the package. For example below I can see that over the entire profiling period about ~18 seconds were spent within check_agg_scale_subtree_dt
. This allows me to pick exact spots to speed up.
library(hierarchyUtils)
library(data.table)
library(profvis)
n_draws <- 1000
# default variables for aggregation timings
age_mapping <- data.table(age_start = c(0, seq(0, 90, 5)), age_end = c(Inf, seq(5, 95, 5)))
sex_mapping <- data.table(parent = "all", child = c("male", "female"))
agg_id_vars <- list(
location = 1,
year_start = seq(1950, 2020, 1),
sex = c("male", "female"),
age_start = seq(0, 95, 1),
value1 = 1, value2 = 1
)
# create input dataset
agg_id_vars <- copy(agg_id_vars)
agg_id_vars[["draw"]] <- 1:n_draws
input_dt <- do.call(CJ, agg_id_vars)
# add interval end columns
input_dt[, year_end := year_start + 1]
input_dt[, age_end := age_start + 1]
input_dt[age_start == 95, age_end := Inf]
# identify value and id cols
value_cols <- grep("value", names(input_dt), value = TRUE)
id_cols <- names(input_dt)[!names(input_dt) %in% value_cols]
profvis::profvis(
expr = {
hierarchyUtils_output_dt <- agg(
dt = input_dt,
id_cols = id_cols, value_cols = value_cols,
col_stem = "age", col_type = "interval",
mapping = age_mapping
)
},
interval = 0.005
)
In the PR description, maybe include the speed vignette output before the changes as well as after the changes? Not crucial here, but could be useful if you create similar PRs in the future.
Added a section to the packageTemplate wiki now https://github.com/ihmeuw-demographics/packageTemplate/wiki/Profiling-R-Functions
Describe changes
I used R profiling tools to speed up the aggregation function when working with a large dataset. Included examples as comments to show how I profiled.
Here are the timings from the performance vignette with these changes #55
What issues are related
Related to #47
Checklist
Packages Repositories
ihmeuw-demographics
R packages?devtools::check()
locally?devtools::document()
?ihmeuw-demographics
code style?docker-base
ordocker-internal
? If so follow directions in those repositories to rebuild and redeploy the images.