Open wlandau opened 5 years ago
The current solution from this post spends a ton of time slicing the serialized data. Ideally, we should work around this so the speed approaches what we would expect from writeBin()
. The following profiling study requires jointprof
, profile
, and pprof
.
write_bin <- function(
value,
filename,
chunk_size = 2L^20L,
compress = FALSE
) {
total_size <- length(value)
start_byte <- 0L
while(start_byte < total_size) {
end_byte <- min(start_byte + chunk_size, total_size) - 1L
this_chunk <- value[seq(start_byte, end_byte, by = 1)]
con <- con_bin(filename, compress, "a+b")
writeBin(this_chunk, con)
close(con)
start_byte <- start_byte + chunk_size
}
}
con_bin <- function(filename, compress, open) {
if(compress) {
gzfile(filename, open)
} else {
file(filename, open)
}
}
nrow <- 1e7
ncol <- 30
data <- as.data.frame(matrix(runif(nrow * ncol), nrow = nrow, ncol = ncol))
obj <- serialize(data, NULL, ascii = FALSE, xdr = TRUE)
library(profile)
library(jointprof)
rprof_file <- tempfile()
proto_file <- tempfile()
Rprof(filename = rprof_file)
write_bin(obj, tempfile())
Rprof(NULL)
data <- read_rprof(rprof_file)
write_pprof(data, proto_file)
file.copy(proto_file, "~/Downloads/profile.proto")
system2(
find_pprof(),
c(
"-http",
"0.0.0.0:8888",
proto_file
)
)
Unsurprisingly, things seem to speed up if we add more chunks. Maybe that's good enough for a first implementation. Later, we might think about diving into https://github.com/wch/r-source/blob/81b9a6700d78ab73a5c948bf6d96d1701ab48039/src/main/connections.c#L4387 and adding start and stop bytes to do_writebin()
.
If we decrease the chunk size down to 2^10, we spend a lot of time opening and closing connections. From the StackOverflow post, it was reported that a new connection was required for each append step, but I have not attempted to replicate this myself.
...which makes me think we should make our slices as large as possible and avoid creating large index vectors. We may be able to use utils::head()
, utils::tail()
, and/or the size
argument of base::writeBin()
.
Nope: using head()
and tail()
use seq_len()
and seq_int()
, and they do not get around ordinary slicing. Performance reflects such (benchmarks not shown). Probably means we need to either
writeBin()
, orwriteBin()
that allows users to choose the start and end bytes.cc @gmbecker, @clarkfitzg.
Well, (2) seems unlikely to work because the restriction here is the maximum value of an integer.
I wonder if we could write compiled code to slice the raw vector more efficiently. If we know the start and end bytes, it seems like this should just be constant-time pointer arithmetic. Then, we could use base::writeBin()
on the slices as before.
I also encountered the long vector issue in writeBin()
and am using your write_bin()
function. Thanks! Two minor comments in case you're interested:
I think the start_byte
and end_byte
values might be wrong (of course this doesn't matter if you're only after benchmarking). I'm using
start_byte <- 1L
end_byte <- min(start_byte + chunk_size - 1L, total_size)
looking at the first benchmark, quite a bit of time is being spent in seq()
. To me, this is a bit of a surprise, as compact ALTREP sequences should be blazing fast, no? I thought, maybe seq()
didn't generate a compact sequence and used seq.int()
for myself instead. But it turns out that
x <- seq(2^10, 2^12)
.Internal(inspect(x))
#> @7f86765999e8 13 INTSXP g1c0 [MARK,NAM(3)] 1024 : 4096 (compact)
Did you use R < 3.5.0 for your benchmarks? Do you by any chance know how benchmarking ALTREP stuff works? I mean, seq()
/seq.int()
should return quickly, also for long sequences, but at a later stage a costly memory allocation might occur. Is the time spent for this then somehow assigned to the function that created the ALTREP object?
Feel free to take write_bin()
and run with it. Now that we have https://github.com/fstpackage/fst/issues/201, I am no longer pursuing it. I agree that the time spent in seq()
is counterintuitive given that I was using R >= 3.5.0 (I do not remember the patch version).
By the way, there is a chance that long vector support for writeBin()
may arrive in R 4.0.0 (https://github.com/HenrikBengtsson/Wishlist-for-R/issues/97#issuecomment-563441539). If that happens, we can avoid repacking large objects.
https://github.com/richfitz/storr/blob/0c64f1efb0574b46059a962de9d264a74d448843/R/hash.R#L97-L98
To preserve the old behavior in R 3.x.x, I propose we make a decision in .onLoad()
about what do do with raw vectors and then cache the result as a logical in a local package environment (as done similarly here).
Continuing from #106. I propose we try this solution and append to RDS files in chunks using
writeBin()
. From the benchmarks in https://github.com/richfitz/storr/issues/103#issuecomment-502097274, I think it will be well worth the trouble. I will work on a PR.