ocbe-uio / contingencytables

Statistical Analysis of Contingency Tables
https://ocbe-uio.github.io/contingencytables/
GNU General Public License v3.0
3 stars 2 forks source link

Optimize convertFunName2Method() #49

Closed wleoncio closed 1 year ago

wleoncio commented 1 year ago

Looks like the callstack can get really long, which makes dispatching using this function very slow. One idea would be to replace the if-else statements with switch(). Perhaps simplifying the callstack object (from a list to a vector, for example) also helps.

Example:

Exact_cond_midP_linear_rank_tests_2xc(fontanella_2008)
wleoncio commented 1 year ago

Using Exact_cond_midP_unspecific_ordering_rx2(t(fontanella_2008), "decreasing") as an example, this is the gold standard profile, achieved when calc_prob.ExactCond_unspecific() is called directly:

$by.self
                                          self.time self.pct total.time total.pct
"test_statistic"                              33.92    65.36      41.96     80.85
"calc_prob.ExactCond_unspecific"               3.34     6.44       3.46      6.67
"calc_Pvalue_4x2.ExactCond_unspecific"         3.10     5.97      51.88     99.96
"rbind"                                        2.62     5.05       3.22      6.20
"rowSums"                                      1.96     3.78       2.82      5.43
"matrix"                                       1.64     3.16       1.70      3.28
"c"                                            0.74     1.43       0.74      1.43
"=="                                           0.62     1.19       0.62      1.19
"%in%"                                         0.54     1.04       0.72      1.39
":"                                            0.52     1.00       0.52      1.00
"is.data.frame"                                0.44     0.85       0.44      0.85
"^"                                            0.42     0.81       0.42      0.81
"/"                                            0.32     0.62       0.32      0.62
"+"                                            0.32     0.62       0.32      0.62
"prod"                                         0.30     0.58       0.30      0.58
"("                                            0.26     0.50       0.26      0.50
"-"                                            0.22     0.42       0.22      0.42
">"                                            0.22     0.42       0.22      0.42
"*"                                            0.16     0.31       0.16      0.31
"is.array"                                     0.06     0.12       0.06      0.12
"is.atomic"                                    0.06     0.12       0.06      0.12
"<"                                            0.04     0.08       0.04      0.08
"length"                                       0.04     0.08       0.04      0.08
"Exact_cond_midP_unspecific_ordering_rx2"      0.02     0.04      51.90    100.00
"dimnames"                                     0.02     0.04       0.02      0.04

$by.total
                                          total.time total.pct self.time self.pct
"Exact_cond_midP_unspecific_ordering_rx2"      51.90    100.00      0.02     0.04
"calc_Pvalue_4x2.ExactCond_unspecific"         51.88     99.96      3.10     5.97
"test_statistic"                               41.96     80.85     33.92    65.36
"calc_prob.ExactCond_unspecific"                3.46      6.67      3.34     6.44
"rbind"                                         3.22      6.20      2.62     5.05
"rowSums"                                       2.82      5.43      1.96     3.78
"matrix"                                        1.70      3.28      1.64     3.16
"c"                                             0.74      1.43      0.74     1.43
"%in%"                                          0.72      1.39      0.54     1.04
"=="                                            0.62      1.19      0.62     1.19
":"                                             0.52      1.00      0.52     1.00
"is.data.frame"                                 0.44      0.85      0.44     0.85
"^"                                             0.42      0.81      0.42     0.81
"/"                                             0.32      0.62      0.32     0.62
"+"                                             0.32      0.62      0.32     0.62
"prod"                                          0.30      0.58      0.30     0.58
"("                                             0.26      0.50      0.26     0.50
"-"                                             0.22      0.42      0.22     0.42
">"                                             0.22      0.42      0.22     0.42
"*"                                             0.16      0.31      0.16     0.31
"is.array"                                      0.06      0.12      0.06     0.12
"is.atomic"                                     0.06      0.12      0.06     0.12
"<"                                             0.04      0.08      0.04     0.08
"length"                                        0.04      0.08      0.04     0.08
"dimnames"                                      0.02      0.04      0.02     0.04

$sample.interval
[1] 0.02

$sampling.time
[1] 51.9

As a matter of fact, we can measure the overhead of convertFunName2Method() by adding a breakpoint on calc_Pvalue_4x2.ExactCond_unspecific() and running the following:

Browse[1]> microbenchmark::microbenchmark(
             "method" = calc_prob.ExactCond_unspecific(x[, 1], 4, N_choose_np1, nip_choose_xi1),
             "generic" = calc_prob(x[, 1], 4, N_choose_np1, nip_choose_xi1)
           )
Unit: microseconds
    expr    min      lq     mean  median     uq     max neval cld
  method  4.017  4.3655  4.84964  4.5275  4.776  16.109   100  a 
 generic 76.810 78.1670 81.29589 80.0300 81.728 165.326   100   b

So the dispatching adds a horrifying 20x time penalty to calc_prob() in this example.

wleoncio commented 1 year ago

Functions involved in the convertFunName2Method() dispatching taking most total time:

$by.total
                                          total.time total.pct self.time self.pct
"convertFunName2Method"                        45.20     44.38      4.02     3.95
"gsub"                                         29.14     28.61      9.58     9.41
"is.factor"                                    19.54     19.19      0.52     0.51
"as.character"                                 17.68     17.36     17.68    17.36
"findInCallstack"                              11.76     11.55      6.08     5.97
"match"                                         4.94      4.85      4.94     4.85

So these should be the first targets for further optimization.