r-lib / rlang

Low-level API for programming with R
https://rlang.r-lib.org
Other
489 stars 131 forks source link

Add ALTREP view support #1725

Open DavisVaughan opened 6 days ago

DavisVaughan commented 6 days ago

Then

Adds support for ALTREP contiguous views, i.e. vec_view(x, start, size), for the major R types

A few notes:

Here are important notes on the internal structure:

/*
Structure of a `view`:
- `data1` is:
  - the original vector, before materialization
  - the materialized vector, after materialization
- `data2` is a RAWSXP holding a `r_view_metadata`
*/

struct r_view_metadata {
  // Whether or not the ALTREP view has been materialized.
  bool materialized;

  // The offset into the original data to start at.
  r_ssize start;

  // The size of the view.
  r_ssize size;

  // A read only pointer into the data, to save indirection costs. We typically
  // set this upon view creation, unless `x` is ALTREP, in which case we delay
  // setting it until the first `DATAPTR_RO()` request, to be friendly to
  // ALTREP types that we wrap. After materialization, it is always set.
  const void* v_data_read;

  // A write only pointer into the data, to save indirection costs.
  // Always `NULL` before materialization, and it set at materialization time.
  // Never set for lists or character vectors, as write pointers are unsafe.
  void* v_data_write;
};

Benchmarking - The most notable thing in the benchmarks is that slicing with a large slice can be up to 2x slower than native code due to ExtractSubset() in base R using the *_Elt() ALTREP method to extract out each element. I think it could be much more efficient by using Dataptr_or_null() to first see if it could get a cheap readonly dataptr to the altrep data (it can here, it's a view) and then using that directly. I plan to submit that patch or talk to Luke about it, then this difference would poof go away entirely.

Also note that the difference is mostly apparent only when a contiguous slice is made with ExtractSubset(), where native objects probably have 0 cache misses, but we have to go through an ALTREP method each time. When you extract a random slice of the same size, the timings actually tend to converge and cache misses become more important.

Otherwise I think it's pretty darn good!

# R-devel at the time
getRversion()
#> [1] '4.5.0'

# load rlang
devtools::load_all(path = "~/files/r/packages/rlang/")

# Identical test
# Currently `identical()` forces materialization due to usage of `INTEGER()`
# rather than `INTEGER_RO()`, booooo.
x <- c(1, 2, 3, 4)
y <- vec_view(x, start = 1, size = 4)
identical(x, y)
#> [1] TRUE
view_is_materialized(y)
#> [1] TRUE

# Object size test
# I think lobstr is more accurate here, there is a small fixed amount of overhead
# from the metadata. Nicely neither materialize the object.
x <- c(1, 2, 3, 4)
y <- vec_view(x, start = 1, size = 4)
object.size(x)
#> 80 bytes
object.size(y)
#> 80 bytes
lobstr::obj_size(x)
#> 80 B
lobstr::obj_size(y)
#> 776 B
view_is_materialized(y)
#> [1] FALSE

# Small view, 100 elements
base <- sample(1e6)
x <- base[1:100]
y <- vec_view(base, start = 1L, size = 100L)

is_altrep(y)
#> [1] TRUE
view_is_materialized(y)
#> [1] FALSE

# One element
bench::mark(
  x[[25L]],
  y[[25L]],
  iterations = 1000000
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x[[25L]]          0     41ns 23280892.        0B     69.8
#> 2 y[[25L]]          0     41ns 18932329.        0B     56.8

# A slice
bench::mark(
  x[10:60],
  y[10:60],
  iterations = 1000000
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x[10:60]      123ns    246ns  3228295.      512B     25.8
#> 2 y[10:60]      287ns    369ns  2216498.      512B     15.5

# Still not materialized!
view_is_materialized(y)
#> [1] FALSE

# Duplication currently forces materialization, I think this is something
# we can fix in base R. They use `INTEGER()` where I think they should use
# `INTEGER_RO()`. Returned object is not ALTREP.
z <- duplicate(y)
view_is_materialized(y)
#> [1] TRUE

is_altrep(z)
#> [1] FALSE

# A slice after materialization - still takes the same amount of time as before materialization
# (in both cases we use a cached readonly pointer to the data)
bench::mark(
  x[10:60],
  y[10:60],
  iterations = 1000000
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x[10:60]      123ns    246ns  3180695.      512B     22.3
#> 2 y[10:60]      287ns    369ns  2222617.      512B     13.3

# Large view, 1,000,000 elements
base <- sample(1e7)
x <- base[seq(from = 1000, length.out = 1000000)]
y <- vec_view(base, start = 1000, size = 1000000)

# Large slice (still roughly 2x slower, same as with small slice)
bench::mark(
  x[1:500000],
  y[1:500000],
  iterations = 1000
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x[1:5e+05] 909.54µs   1.17ms      822.    3.81MB     24.5
#> 2 y[1:5e+05]   2.19ms   2.42ms      413.    3.81MB     10.2

# Large slice, random index
# The difference goes away as cache misses tend to dominate instead
idx <- sample(1e7, 1e6)
bench::mark(
  x[idx],
  y[idx],
  iterations = 1000
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x[idx]       2.19ms    2.4ms      408.    3.81MB    10.5 
#> 2 y[idx]       2.47ms   2.69ms      357.    3.81MB     8.78

# Still not materialized!
view_is_materialized(y)
#> [1] FALSE