writeBin() in chunks - Githubissues

wlandau commented 5 years ago

Continuing from #106. I propose we try this solution and append to RDS files in chunks using writeBin(). From the benchmarks in https://github.com/richfitz/storr/issues/103#issuecomment-502097274, I think it will be well worth the trouble. I will work on a PR.

wlandau commented 5 years ago

The current solution from this post spends a ton of time slicing the serialized data. Ideally, we should work around this so the speed approaches what we would expect from writeBin(). The following profiling study requires jointprof, profile, and pprof.

write_bin <- function(
  value,
  filename,
  chunk_size = 2L^20L,
  compress = FALSE
) {
  total_size <- length(value)
  start_byte <- 0L
  while(start_byte < total_size) {
    end_byte <- min(start_byte + chunk_size, total_size) - 1L
    this_chunk <- value[seq(start_byte, end_byte, by = 1)]
    con <- con_bin(filename, compress, "a+b")
    writeBin(this_chunk, con)
    close(con)
    start_byte <- start_byte + chunk_size
  }
}

con_bin <- function(filename, compress, open) {
  if(compress) {
    gzfile(filename, open)
  } else  {
    file(filename, open)
  }
}

nrow <- 1e7
ncol <- 30
data <- as.data.frame(matrix(runif(nrow * ncol), nrow = nrow, ncol = ncol))
obj <- serialize(data, NULL, ascii = FALSE, xdr = TRUE)

library(profile)
library(jointprof)

rprof_file <- tempfile()
proto_file <- tempfile()
Rprof(filename = rprof_file)
write_bin(obj, tempfile())
Rprof(NULL)
data <- read_rprof(rprof_file)
write_pprof(data, proto_file)
file.copy(proto_file, "~/Downloads/profile.proto")
system2(
  find_pprof(),
  c(
    "-http",
    "0.0.0.0:8888",
    proto_file
  )
)

Screenshot_20190614_203246

Unsurprisingly, things seem to speed up if we add more chunks. Maybe that's good enough for a first implementation. Later, we might think about diving into https://github.com/wch/r-source/blob/81b9a6700d78ab73a5c948bf6d96d1701ab48039/src/main/connections.c#L4387 and adding start and stop bytes to do_writebin().

wlandau commented 5 years ago

If we decrease the chunk size down to 2^10, we spend a lot of time opening and closing connections. From the StackOverflow post, it was reported that a new connection was required for each append step, but I have not attempted to replicate this myself.

Screenshot_20190614_204847

wlandau commented 5 years ago

...which makes me think we should make our slices as large as possible and avoid creating large index vectors. We may be able to use utils::head(), utils::tail(), and/or the size argument of base::writeBin().

wlandau commented 5 years ago

Nope: using head() and tail() use seq_len() and seq_int(), and they do not get around ordinary slicing. Performance reflects such (benchmarks not shown). Probably means we need to either

Bring long vector support to writeBin(), or
Implement a version of writeBin() that allows users to choose the start and end bytes.

cc @gmbecker, @clarkfitzg.

wlandau commented 5 years ago

Well, (2) seems unlikely to work because the restriction here is the maximum value of an integer.

wlandau commented 5 years ago

I wonder if we could write compiled code to slice the raw vector more efficiently. If we know the start and end bytes, it seems like this should just be constant-time pointer arithmetic. Then, we could use base::writeBin() on the slices as before.

wlandau commented 5 years ago

Post: https://stackoverflow.com/questions/56614592/faster-way-to-slice-a-raw-vector

nbenn commented 5 years ago

I also encountered the long vector issue in writeBin() and am using your write_bin() function. Thanks! Two minor comments in case you're interested:

I think the start_byte and end_byte values might be wrong (of course this doesn't matter if you're only after benchmarking). I'm using
```
start_byte <- 1L
end_byte <- min(start_byte + chunk_size - 1L, total_size)
```
looking at the first benchmark, quite a bit of time is being spent in seq(). To me, this is a bit of a surprise, as compact ALTREP sequences should be blazing fast, no? I thought, maybe seq() didn't generate a compact sequence and used seq.int() for myself instead. But it turns out that
```
x <- seq(2^10, 2^12)
.Internal(inspect(x))
#> @7f86765999e8 13 INTSXP g1c0 [MARK,NAM(3)]  1024 : 4096 (compact)
```
Did you use R < 3.5.0 for your benchmarks? Do you by any chance know how benchmarking ALTREP stuff works? I mean, seq()/seq.int() should return quickly, also for long sequences, but at a later stage a costly memory allocation might occur. Is the time spent for this then somehow assigned to the function that created the ALTREP object?

wlandau commented 5 years ago

Feel free to take write_bin() and run with it. Now that we have https://github.com/fstpackage/fst/issues/201, I am no longer pursuing it. I agree that the time spent in seq() is counterintuitive given that I was using R >= 3.5.0 (I do not remember the patch version).

wlandau commented 4 years ago

By the way, there is a chance that long vector support for writeBin() may arrive in R 4.0.0 (https://github.com/HenrikBengtsson/Wishlist-for-R/issues/97#issuecomment-563441539). If that happens, we can avoid repacking large objects.

https://github.com/richfitz/storr/blob/0c64f1efb0574b46059a962de9d264a74d448843/R/hash.R#L97-L98

To preserve the old behavior in R 3.x.x, I propose we make a decision in .onLoad() about what do do with raw vectors and then cache the result as a logical in a local package environment (as done similarly here).

richfitz / storr

writeBin() in chunks #107