tdhock / atime

Asymptotic timing
2 stars 3 forks source link

atime: Asymptotic Timing

| [[file:tests/testthat][tests]] | [[https://github.com/tdhock/atime/actions][https://github.com/tdhock/atime/workflows/R-CMD-check/badge.svg]] | | [[https://github.com/jimhester/covr][coverage]] | [[https://app.codecov.io/gh/tdhock/atime?branch=main][https://codecov.io/gh/tdhock/atime/branch/main/graph/badge.svg]] |

** Installation

+BEGIN_SRC R

Install last released version from CRAN:

install.packages("atime")

Install latest version from GitHub:

if(!require("remotes"))install.packages("remotes") remotes::install_github("tdhock/atime")

+END_SRC

** Usage

The main function is =atime= for which you can specify these arguments:

+BEGIN_SRC R

When studying asymptotic complexity, always provide sizes on a log

scale (10^sequence) as below:

(subject.size.vec <- unique(as.integer(10^seq(0,3.5,l=100))))

Compute asymptotic time and memory measurement:

atime.list <- atime::atime( N=subject.size.vec,#vector of sizes. setup={#Run for each size, before timings: subject <- paste(rep("a", N), collapse="") pattern <- paste(rep(c("a?", "a"), each=N), collapse="") }, times=10,#number of timings to compute for each expression. seconds.limit=0.1,#max seconds per expression.

Different expressions which will be evaluated for each size N:

PCRE.match=regexpr(pattern, subject, perl=TRUE),
TRE.match=regexpr(pattern, subject, perl=FALSE),
constant.replacement=gsub("a","constant size replacement",subject),
linear.replacement=gsub("a",subject,subject))

atime.list plot(atime.list)

Compute and plot asymptotic reference lines:

(best.list <- atime::references_best(atime.list)) plot(best.list)

Compute and plot data size N for given time/memory.

pred.list <- predict(best.list, seconds=1e-2, kilobytes=10) plot(pred.list)

+END_SRC

*** Time/memory comparison overview

On my machine I got the following results:

+begin_src R

(subject.size.vec <- unique(as.integer(10^seq(0,3.5,l=100)))) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [16] 17 18 20 22 23 25 28 30 33 35 38 42 45 49 53 [31] 58 63 68 74 81 87 95 103 112 121 132 143 155 168 183 [46] 198 215 233 253 275 298 323 351 380 413 448 486 527 572 620 [61] 673 730 792 859 932 1011 1097 1190 1291 1401 1519 1648 1788 1940 2104 [76] 2283 2477 2687 2915 3162

+end_src

The vector above is the sequence of sizes N, used with each expression, to measure time and memory. When studying asymptotic complexity, always provide sizes on a log scale as above.

+begin_src R

atime.list atime list with 228 measurements for PCRE.match(N=1 to 20) TRE.match(N=1 to 275) constant.replacement(N=1 to 3162) linear.replacement(N=1 to 3162)

+end_src

The output above shows the min and max N values that were run for each of the expressions. In this case =constant.replacement= and =linear.replacement= were run all the way up to the max size (3162), but =TRE.match= only went up to 20, and =PCRE.match= only went up to 275, because no larger N values are considered after the median time for a given N has has exceeded =seconds.limit= which is 0.1 above. This behavior ensures that total time taken by =atime= will be about seconds.limit times number of expressions (times is the number of times each expression is evaluated at each data size). The output of the plot method for this =atime= result list is shown below,

+begin_src R

plot(atime.list)

+end_src

[[file:README-figure-compare.png]]

The plot above facilitates comparing the time and memory of the different expressions, and makes it easy to see the ranking of different algorithms, but it does not show the asymptotic complexity class.

*** Asymptotic complexity class estimation

To estimate the asymptotic complexity class, use the code below:

+begin_src R

(best.list <- atime::references_best(atime.list)) references_best list with 456 measurements, best fit complexity: constant.replacement (N kilobytes, N seconds) linear.replacement (N^2 kilobytes, N^2 seconds) PCRE.match (2^N seconds) TRE.match (N^3 seconds)

+end_src

The output above shows the best fit asymptotic time complexity for each expression. To visualize the results you can do:

+BEGIN_SRC R

plot(best.list)

+END_SRC

[[file:README-figure.png]]

The plot above shows the timings of each expression as a function of data size N (black), as well as the two closest asymptotic reference lines (violet, one smaller, one larger). If you have chosen N and seconds.limit appropriately for your problem (as we have in this case) then you should be able to observe the following:

*** Highlight N for given time/memory

When comparing algorithms in terms of computational resources, we can either

We can do both using the code below,

+begin_src R

atime.list[["measurements"]][N==323, .(expr.name, seconds=median, kilobytes)] expr.name seconds kilobytes

1: TRE.match 0.0678032 0.0000 2: constant.replacement 0.0000667 7.9375 3: linear.replacement 0.0002435 101.9375 pred.list <- predict(best.list, seconds=1e-2, kilobytes=10) pred.list[["prediction"]] unit expr.name unit.value N 1: seconds PCRE.match 0.01 17.82348 2: seconds TRE.match 0.01 168.46338 3: seconds linear.replacement 0.01 2069.38604 4: kilobytes constant.replacement 10.00 407.55220 5: kilobytes linear.replacement 10.00 100.92007 plot(pred.list) #+end_src

[[file:README-predict.png]]

** GitHub action for continuous performance testing

[[https://github.com/marketplace/actions/autocomment-atime-results][autocomment-atime-results]] is a GitHub action which will run atime for every pull request, and plot results in a PR comment, so you can see if the PR affects performance (examples: [[https://github.com/Anirban166/binsegRcpp/pull/2#issuecomment-1986146565][binsegRcpp]], [[https://github.com/Rdatatable/data.table/pull/5427#issuecomment-2075471806][data.table]]). First, you should define a =.ci/atime/tests.R= code file that creates an R object called =test.list= which should be a list of performance tests, each one is a list of arguments that will be passed to =atime_versions=, see [[https://github.com/tdhock/binsegRcpp/blob/atime-test-funs/.ci/atime/tests.R][atime-test-funs branch of binsegRcpp repo]] for an example, and see [[https://github.com/tdhock/atime/blob/main/man/atime_pkg.Rd][?atime_pkg]] for documentation.

** Related work

[[https://cloud.r-project.org/web/packages/bench/][bench]]::press does something similar, and is more flexible because it can do multi-dimensional grid search (not only over a single size N argument as atime does). However it can not store results if check=FALSE, results must be equal if check=TRUE, and there is no way to easily specify a time limit which stops for larger sizes (like seconds.limit argument in atime).

[[https://github.com/Anirban166/testComplexity][testComplexity]]::asymptoticTimings does something similar, but only for one expression (not several), and there is no special setup argument like atime (which means that the timing must include data setup code which may be irrelevant).