[ ] Hear back about DATAPTR() usage in character implementation
[x] Extract_subset implementation? Or is the fallback good enough? (New approach is much faster even without Extract_subset due to faster Elt access with cached pointers, we really want base R to handle all the intricacies of this.)
[x] Use #defines to section off code that doesn't work in older R versions, possibly bump minimal rlang R version
[ ] Decide on r_init_library_with_dll() helper with Lionel
[ ] Support R >=3.6?
Then
[ ] Follow up PR to rename r_vec_resize() to r_vec_grow() and remove support for shrinking
[ ] Use ALTREP views in r_dyn_unwrap() in this PR
Adds support for ALTREP contiguous views, i.e. vec_view(x, start, size), for the major R types
A few notes:
x gets marked as not mutable on the way in
No attributes are allowed on x right now
The ALTREP classes are registered at load time but need the dll, so I had to introduce a new r_init_library_with_dll() helper that goes in your R_init_<pkg>() function (good to know if you are utilize the rlang C API). The ALTREP classes are registered with the package name, so if vctrs uses the rlang C API then it will register its own ALTREP view classes under the "vctrs" package name. I think this is a good thing.
Serialization will force materialization, which I think is nice because it avoids us ever having to worry about "old" ALTREP view objects if we ever change the internal structure.
Here are important notes on the internal structure:
/*
Structure of a `view`:
- `data1` is:
- the original vector, before materialization
- the materialized vector, after materialization
- `data2` is a RAWSXP holding a `r_view_metadata`
*/
struct r_view_metadata {
// Whether or not the ALTREP view has been materialized.
bool materialized;
// The offset into the original data to start at.
r_ssize start;
// The size of the view.
r_ssize size;
// A read only pointer into the data, to save indirection costs. We typically
// set this upon view creation, unless `x` is ALTREP, in which case we delay
// setting it until the first `DATAPTR_RO()` request, to be friendly to
// ALTREP types that we wrap. After materialization, it is always set.
const void* v_data_read;
// A write only pointer into the data, to save indirection costs.
// Always `NULL` before materialization, and it set at materialization time.
// Never set for lists or character vectors, as write pointers are unsafe.
void* v_data_write;
};
Benchmarking - The most notable thing in the benchmarks is that slicing with a large slice can be up to 2x slower than native code due to ExtractSubset() in base R using the *_Elt() ALTREP method to extract out each element. I think it could be much more efficient by using Dataptr_or_null() to first see if it could get a cheap readonly dataptr to the altrep data (it can here, it's a view) and then using that directly. I plan to submit that patch or talk to Luke about it, then this difference would poof go away entirely.
Also note that the difference is mostly apparent only when a contiguous slice is made with ExtractSubset(), where native objects probably have 0 cache misses, but we have to go through an ALTREP method each time. When you extract a random slice of the same size, the timings actually tend to converge and cache misses become more important.
Otherwise I think it's pretty darn good!
# R-devel at the time
getRversion()
#> [1] '4.5.0'
# load rlang
devtools::load_all(path = "~/files/r/packages/rlang/")
# Identical test
# Currently `identical()` forces materialization due to usage of `INTEGER()`
# rather than `INTEGER_RO()`, booooo.
x <- c(1, 2, 3, 4)
y <- vec_view(x, start = 1, size = 4)
identical(x, y)
#> [1] TRUE
view_is_materialized(y)
#> [1] TRUE
# Object size test
# I think lobstr is more accurate here, there is a small fixed amount of overhead
# from the metadata. Nicely neither materialize the object.
x <- c(1, 2, 3, 4)
y <- vec_view(x, start = 1, size = 4)
object.size(x)
#> 80 bytes
object.size(y)
#> 80 bytes
lobstr::obj_size(x)
#> 80 B
lobstr::obj_size(y)
#> 776 B
view_is_materialized(y)
#> [1] FALSE
# Small view, 100 elements
base <- sample(1e6)
x <- base[1:100]
y <- vec_view(base, start = 1L, size = 100L)
is_altrep(y)
#> [1] TRUE
view_is_materialized(y)
#> [1] FALSE
# One element
bench::mark(
x[[25L]],
y[[25L]],
iterations = 1000000
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 x[[25L]] 0 41ns 23280892. 0B 69.8
#> 2 y[[25L]] 0 41ns 18932329. 0B 56.8
# A slice
bench::mark(
x[10:60],
y[10:60],
iterations = 1000000
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 x[10:60] 123ns 246ns 3228295. 512B 25.8
#> 2 y[10:60] 287ns 369ns 2216498. 512B 15.5
# Still not materialized!
view_is_materialized(y)
#> [1] FALSE
# Duplication currently forces materialization, I think this is something
# we can fix in base R. They use `INTEGER()` where I think they should use
# `INTEGER_RO()`. Returned object is not ALTREP.
z <- duplicate(y)
view_is_materialized(y)
#> [1] TRUE
is_altrep(z)
#> [1] FALSE
# A slice after materialization - still takes the same amount of time as before materialization
# (in both cases we use a cached readonly pointer to the data)
bench::mark(
x[10:60],
y[10:60],
iterations = 1000000
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 x[10:60] 123ns 246ns 3180695. 512B 22.3
#> 2 y[10:60] 287ns 369ns 2222617. 512B 13.3
# Large view, 1,000,000 elements
base <- sample(1e7)
x <- base[seq(from = 1000, length.out = 1000000)]
y <- vec_view(base, start = 1000, size = 1000000)
# Large slice (still roughly 2x slower, same as with small slice)
bench::mark(
x[1:500000],
y[1:500000],
iterations = 1000
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 x[1:5e+05] 909.54µs 1.17ms 822. 3.81MB 24.5
#> 2 y[1:5e+05] 2.19ms 2.42ms 413. 3.81MB 10.2
# Large slice, random index
# The difference goes away as cache misses tend to dominate instead
idx <- sample(1e7, 1e6)
bench::mark(
x[idx],
y[idx],
iterations = 1000
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 x[idx] 2.19ms 2.4ms 408. 3.81MB 10.5
#> 2 y[idx] 2.47ms 2.69ms 357. 3.81MB 8.78
# Still not materialized!
view_is_materialized(y)
#> [1] FALSE
DATAPTR()
usage in character implementationExtract_subset
implementation? Or is the fallback good enough? (New approach is much faster even withoutExtract_subset
due to fasterElt
access with cached pointers, we really want base R to handle all the intricacies of this.)#define
s to section off code that doesn't work in older R versions, possibly bump minimal rlang R versionr_init_library_with_dll()
helper with LionelThen
r_vec_resize()
tor_vec_grow()
and remove support for shrinkingr_dyn_unwrap()
in this PRAdds support for ALTREP contiguous views, i.e.
vec_view(x, start, size)
, for the major R typesA few notes:
x
gets marked as not mutable on the way inx
right nowr_init_library_with_dll()
helper that goes in yourR_init_<pkg>()
function (good to know if you are utilize the rlang C API). The ALTREP classes are registered with thepackage
name, so if vctrs uses the rlang C API then it will register its own ALTREP view classes under the"vctrs"
package name. I think this is a good thing.Here are important notes on the internal structure:
Benchmarking - The most notable thing in the benchmarks is that slicing with a large slice can be up to 2x slower than native code due to
ExtractSubset()
in base R using the*_Elt()
ALTREP method to extract out each element. I think it could be much more efficient by usingDataptr_or_null()
to first see if it could get a cheap readonly dataptr to the altrep data (it can here, it's a view) and then using that directly. I plan to submit that patch or talk to Luke about it, then this difference would poof go away entirely.Also note that the difference is mostly apparent only when a contiguous slice is made with
ExtractSubset()
, where native objects probably have 0 cache misses, but we have to go through an ALTREP method each time. When you extract a random slice of the same size, the timings actually tend to converge and cache misses become more important.Otherwise I think it's pretty darn good!