Closed alistaire47 closed 2 years ago
This looks fantastic. Would you mind running them locally + posting results of with | without I'm curious what kinds of results we get.
@jonkeane They're all very fast; this is really more of a microbenchmark. But generally materializing altreps is usually 2-4x slower than subsetting non-altrep, which by my intuition seems good given it's relative to subsetting, which should be extremely fast.
type_integers
:
process real start_mem_bytes end_mem_bytes max_mem_bytes source exclude_nulls altrep subset_indices cpu_count
<dbl> <dbl> <int> <int> <int> <chr> <lgl> <lgl> <chr> <int>
1 0.000393 0.000387 161890304 165478400 198524928 type_integers TRUE FALSE 1:10 1
2 0.000546 0.000539 161234944 164986880 197885952 type_integers TRUE FALSE 1:10 10
3 0.000962 0.000946 161480704 168214528 193970176 type_integers TRUE TRUE 1:10 1
4 0.00102 0.00100 167870464 179486720 201752576 type_integers TRUE TRUE 1:10 10
5 0.00266 0.00262 171540480 197246976 208142336 type_integers FALSE FALSE 1:10 10
6 0.00287 0.00282 168181760 193396736 204734464 type_integers FALSE FALSE 1:10 1
7 0.0130 0.0128 173309952 209682432 209682432 type_integers FALSE TRUE 1:10 1
8 0.0135 0.0141 171556864 213467136 213467136 type_integers FALSE TRUE 1:10 10
fanniemae_2016Q4
:
process real gc_level0 gc_level1 gc_level2 source exclude_nulls altrep subset_indices cpu_count
<dbl> <dbl> <int> <int> <int> <chr> <lgl> <lgl> <chr> <int>
1 0.406 0.400 0 0 2 fanniemae_2016Q4 TRUE FALSE 1:10 1
2 0.408 0.401 0 0 2 fanniemae_2016Q4 TRUE FALSE 1:10 10
3 2.13 2.10 0 0 6 fanniemae_2016Q4 TRUE TRUE 1:10 10
4 2.47 2.74 0 0 6 fanniemae_2016Q4 TRUE TRUE 1:10 1
5 3.85 5.74 0 0 3 fanniemae_2016Q4 FALSE FALSE 1:10 10
6 4.32 6.76 0 0 3 fanniemae_2016Q4 FALSE FALSE 1:10 1
7 10.6 14.3 0 0 14 fanniemae_2016Q4 FALSE TRUE 1:10 10
8 10.9 14.9 0 0 14 fanniemae_2016Q4 FALSE TRUE 1:10 1
That's great, thanks!
Closes #38. Adds a benchmark that times how long it takes to materialize a subset or entire vector, with/without altrep enabled. Approach based on this comment, but lets NAs be handled by data sources (the
type_{type}
sources already have columns with varying proportions of NAs). Subsets data sources down to columns that can be stored as altrep.