voltrondata-labs / arrowbench

R package for benchmarking
Other
13 stars 9 forks source link

Disable altrep when setting up the df to table bnechmark #45

Closed jonkeane closed 3 years ago

jonkeane commented 3 years ago

When we enabled altrep, the df to table benchmarks started taking longer. This is because the setup code here would create dataframes that were backed by altreps to arrow arrays and didn't fully convert them to R vectors. Then when we converted from R->arrow we were now actually measuring both arrow->R and R->arrow.

jonkeane commented 3 years ago

I confirmed that this works locally by adding an option and running benchmarks with/without altrep disabled in the setup (and before/after the commit that enabled altreps here).

 # A tibble: 16 × 5
# Groups:   lib_path, source [8]
   lib_path                    source        use_altrep_setup process_median real_median
   <chr>                       <chr>         <lgl>                     <dbl>       <dbl>
 1 remote-apache/arrow@425b1cb type_dict     FALSE                    0.111      0.0290 
 2 remote-apache/arrow@425b1cb type_dict     TRUE                     0.111      0.0298 
 3 remote-apache/arrow@HEAD    type_dict     FALSE                    0.104      0.0275 
 4 remote-apache/arrow@HEAD    type_dict     TRUE                     0.104      0.0271 
 5 remote-apache/arrow@425b1cb type_floats   FALSE                    0.0179     0.00496
 6 remote-apache/arrow@425b1cb type_floats   TRUE                     0.0267     0.0129 
 7 remote-apache/arrow@HEAD    type_floats   FALSE                    0.0170     0.00465
 8 remote-apache/arrow@HEAD    type_floats   TRUE                     0.105      0.105  
 9 remote-apache/arrow@425b1cb type_integers FALSE                    0.0107     0.00363
10 remote-apache/arrow@425b1cb type_integers TRUE                     0.0198     0.0108 
11 remote-apache/arrow@HEAD    type_integers FALSE                    0.0104     0.00344
12 remote-apache/arrow@HEAD    type_integers TRUE                     0.0921     0.0924 
13 remote-apache/arrow@425b1cb type_strings  FALSE                    0.638      0.640  
14 remote-apache/arrow@425b1cb type_strings  TRUE                     0.630      0.633  
15 remote-apache/arrow@HEAD    type_strings  FALSE                    0.603      0.604  
16 remote-apache/arrow@HEAD    type_strings  TRUE                     0.615      0.616 

The critical change is that the @HEAD is slower when use_altrep_setup = TRUE and faster when use_altrep_setup = FALSE (and the faster is ~ the same as before altrep was around @425b1cb). For example, for floats:

  lib_path                    source      use_altrep_setup process_median real_median
  <chr>                       <chr>       <lgl>                     <dbl>       <dbl>
1 remote-apache/arrow@425b1cb type_floats FALSE                    0.0179     0.00496
2 remote-apache/arrow@425b1cb type_floats TRUE                     0.0267     0.0129 
3 remote-apache/arrow@HEAD    type_floats FALSE                    0.0170     0.00465
4 remote-apache/arrow@HEAD    type_floats TRUE                     0.105      0.105