ihmeuw-demographics / hierarchyUtils

Demographics Related Utility Functions
https://ihmeuw-demographics.github.io/hierarchyUtils/
BSD 3-Clause "New" or "Revised" License
8 stars 3 forks source link

Aggregation performance vignette #55

Closed chacalle closed 3 years ago

chacalle commented 3 years ago

Describe changes

This adds a small vignette comparing the speed of hierarchyUtils::agg to basic data.table code. As described in the vignette for basic use cases there should only be slightly more overhead due to the assertions and added flexibility included in hierarchyUtils.

This indicates there is still something slowing hierarchyUtils down that needs to be diagnosed.

What issues are related

Related to #47

Checklist

Packages Repositories

Details of PR

Example table of timings

> agg_timings
   col_stem    col_type n_draws         method n_input_rows user.self sys.self elapsed
1:      age    interval       1     data.table       13,632     0.057    0.006   0.065
2:      age    interval       1 hierarchyUtils       13,632     9.307    0.236   9.665
3:      age    interval      10     data.table      136,320     0.131    0.024   0.157
4:      age    interval      10 hierarchyUtils      136,320    46.943    0.630  47.835
5:      sex categorical       1     data.table       13,632     0.008    0.000   0.008
6:      sex categorical       1 hierarchyUtils       13,632     0.568    0.009   0.581
7:      sex categorical      10     data.table      136,320     0.055    0.008   0.063
8:      sex categorical      10 hierarchyUtils      136,320     4.786    0.090   4.908
chacalle commented 3 years ago

Actually messed up the initial timings and was using an old installed version of hierarchyUtils before we noticed the slow down so that was why the initial set of times I posted is so slow.

Original slow timings:

> agg_timings
   col_stem    col_type n_draws         method n_input_rows user.self sys.self elapsed
1:      age    interval       1     data.table       13,632     0.057    0.006   0.065
2:      age    interval       1 hierarchyUtils       13,632     9.307    0.236   9.665
3:      age    interval      10     data.table      136,320     0.131    0.024   0.157
4:      age    interval      10 hierarchyUtils      136,320    46.943    0.630  47.835
5:      sex categorical       1     data.table       13,632     0.008    0.000   0.008
6:      sex categorical       1 hierarchyUtils       13,632     0.568    0.009   0.581
7:      sex categorical      10     data.table      136,320     0.055    0.008   0.063
8:      sex categorical      10 hierarchyUtils      136,320     4.786    0.090   4.908

Here is a pdf version of the vignette with updated timings. Aggregation_Scaling performance.pdf And pasting the table version from running interactively

> agg_timings
    col_stem    col_type n_draws         method n_input_rows user.self sys.self elapsed
 1:      age    interval       1     data.table       13,632     0.067    0.005   0.071
 2:      age    interval       1 hierarchyUtils       13,632     3.828    0.211   4.145
 3:      age    interval      10     data.table      136,320     0.142    0.020   0.164
 4:      age    interval      10 hierarchyUtils      136,320     3.729    0.257   4.001
 5:      age    interval     100     data.table    1,363,200     0.834    0.230   1.076
 6:      age    interval     100 hierarchyUtils    1,363,200    13.747    1.800  15.708
 7:      age    interval    1000     data.table   13,632,000     7.102    2.639   9.843
 8:      age    interval    1000 hierarchyUtils   13,632,000   104.364   20.984 127.034
 9:      sex categorical       1     data.table       13,632     0.008    0.001   0.009
10:      sex categorical       1 hierarchyUtils       13,632     0.098    0.006   0.104
11:      sex categorical      10     data.table      136,320     0.060    0.009   0.070
12:      sex categorical      10 hierarchyUtils      136,320     0.818    0.052   0.874
13:      sex categorical     100     data.table    1,363,200     0.583    0.065   0.655
14:      sex categorical     100 hierarchyUtils    1,363,200     7.335    0.463   7.995
15:      sex categorical    1000     data.table   13,632,000     5.359    0.809   6.240
16:      sex categorical    1000 hierarchyUtils   13,632,000    68.220    5.740  74.971
chacalle commented 3 years ago

I referenced this SO question in trying to understand it. What we care about is the user time I think, this is the amount of time we wait around for. When the elapsed time is less than the user time then that means the command used multiple cores to speed up.

https://stackoverflow.com/questions/5688949/what-are-user-and-system-times-measuring-in-r-system-timeexp-output