voltrondata-labs / arrowbench

R package for benchmarking
Other
13 stars 9 forks source link

[ENG-3634] Add benchmarks that materialize altrep-backed vectors #89

Closed alistaire47 closed 2 years ago

alistaire47 commented 2 years ago

Closes #38. Adds a benchmark that times how long it takes to materialize a subset or entire vector, with/without altrep enabled. Approach based on this comment, but lets NAs be handled by data sources (the type_{type} sources already have columns with varying proportions of NAs). Subsets data sources down to columns that can be stored as altrep.

alistaire47 commented 2 years ago

This looks fantastic. Would you mind running them locally + posting results of with | without I'm curious what kinds of results we get.

@jonkeane They're all very fast; this is really more of a microbenchmark. But generally materializing altreps is usually 2-4x slower than subsetting non-altrep, which by my intuition seems good given it's relative to subsetting, which should be extremely fast.

type_integers:

   process     real start_mem_bytes end_mem_bytes max_mem_bytes source        exclude_nulls altrep subset_indices cpu_count
     <dbl>    <dbl>           <int>         <int>         <int> <chr>         <lgl>         <lgl>  <chr>              <int>
1 0.000393 0.000387       161890304     165478400     198524928 type_integers TRUE          FALSE  1:10                   1
2 0.000546 0.000539       161234944     164986880     197885952 type_integers TRUE          FALSE  1:10                  10
3 0.000962 0.000946       161480704     168214528     193970176 type_integers TRUE          TRUE   1:10                   1
4 0.00102  0.00100        167870464     179486720     201752576 type_integers TRUE          TRUE   1:10                  10
5 0.00266  0.00262        171540480     197246976     208142336 type_integers FALSE         FALSE  1:10                  10
6 0.00287  0.00282        168181760     193396736     204734464 type_integers FALSE         FALSE  1:10                   1
7 0.0130   0.0128         173309952     209682432     209682432 type_integers FALSE         TRUE   1:10                   1
8 0.0135   0.0141         171556864     213467136     213467136 type_integers FALSE         TRUE   1:10                  10

fanniemae_2016Q4:

  process   real gc_level0 gc_level1 gc_level2 source           exclude_nulls altrep subset_indices cpu_count
    <dbl>  <dbl>     <int>     <int>     <int> <chr>            <lgl>         <lgl>  <chr>              <int>
1   0.406  0.400         0         0         2 fanniemae_2016Q4 TRUE          FALSE  1:10                   1
2   0.408  0.401         0         0         2 fanniemae_2016Q4 TRUE          FALSE  1:10                  10
3   2.13   2.10          0         0         6 fanniemae_2016Q4 TRUE          TRUE   1:10                  10
4   2.47   2.74          0         0         6 fanniemae_2016Q4 TRUE          TRUE   1:10                   1
5   3.85   5.74          0         0         3 fanniemae_2016Q4 FALSE         FALSE  1:10                  10
6   4.32   6.76          0         0         3 fanniemae_2016Q4 FALSE         FALSE  1:10                   1
7  10.6   14.3           0         0        14 fanniemae_2016Q4 FALSE         TRUE   1:10                  10
8  10.9   14.9           0         0        14 fanniemae_2016Q4 FALSE         TRUE   1:10                   1
jonkeane commented 2 years ago

That's great, thanks!