richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
117 stars 10 forks source link

writeBin() in chunks #107

Open wlandau opened 5 years ago

wlandau commented 5 years ago

Continuing from #106. I propose we try this solution and append to RDS files in chunks using writeBin(). From the benchmarks in https://github.com/richfitz/storr/issues/103#issuecomment-502097274, I think it will be well worth the trouble. I will work on a PR.

wlandau commented 5 years ago

The current solution from this post spends a ton of time slicing the serialized data. Ideally, we should work around this so the speed approaches what we would expect from writeBin(). The following profiling study requires jointprof, profile, and pprof.

write_bin <- function(
  value,
  filename,
  chunk_size = 2L^20L,
  compress = FALSE
) {
  total_size <- length(value)
  start_byte <- 0L
  while(start_byte < total_size) {
    end_byte <- min(start_byte + chunk_size, total_size) - 1L
    this_chunk <- value[seq(start_byte, end_byte, by = 1)]
    con <- con_bin(filename, compress, "a+b")
    writeBin(this_chunk, con)
    close(con)
    start_byte <- start_byte + chunk_size
  }
}

con_bin <- function(filename, compress, open) {
  if(compress) {
    gzfile(filename, open)
  } else  {
    file(filename, open)
  }
}

nrow <- 1e7
ncol <- 30
data <- as.data.frame(matrix(runif(nrow * ncol), nrow = nrow, ncol = ncol))
obj <- serialize(data, NULL, ascii = FALSE, xdr = TRUE)

library(profile)
library(jointprof)

rprof_file <- tempfile()
proto_file <- tempfile()
Rprof(filename = rprof_file)
write_bin(obj, tempfile())
Rprof(NULL)
data <- read_rprof(rprof_file)
write_pprof(data, proto_file)
file.copy(proto_file, "~/Downloads/profile.proto")
system2(
  find_pprof(),
  c(
    "-http",
    "0.0.0.0:8888",
    proto_file
  )
)

Screenshot_20190614_203246

Unsurprisingly, things seem to speed up if we add more chunks. Maybe that's good enough for a first implementation. Later, we might think about diving into https://github.com/wch/r-source/blob/81b9a6700d78ab73a5c948bf6d96d1701ab48039/src/main/connections.c#L4387 and adding start and stop bytes to do_writebin().

wlandau commented 5 years ago

If we decrease the chunk size down to 2^10, we spend a lot of time opening and closing connections. From the StackOverflow post, it was reported that a new connection was required for each append step, but I have not attempted to replicate this myself.

Screenshot_20190614_204847

wlandau commented 5 years ago

...which makes me think we should make our slices as large as possible and avoid creating large index vectors. We may be able to use utils::head(), utils::tail(), and/or the size argument of base::writeBin().

wlandau commented 5 years ago

Nope: using head() and tail() use seq_len() and seq_int(), and they do not get around ordinary slicing. Performance reflects such (benchmarks not shown). Probably means we need to either

  1. Bring long vector support to writeBin(), or
  2. Implement a version of writeBin() that allows users to choose the start and end bytes.

cc @gmbecker, @clarkfitzg.

wlandau commented 5 years ago

Well, (2) seems unlikely to work because the restriction here is the maximum value of an integer.

wlandau commented 5 years ago

I wonder if we could write compiled code to slice the raw vector more efficiently. If we know the start and end bytes, it seems like this should just be constant-time pointer arithmetic. Then, we could use base::writeBin() on the slices as before.

wlandau commented 5 years ago

Post: https://stackoverflow.com/questions/56614592/faster-way-to-slice-a-raw-vector

nbenn commented 5 years ago

I also encountered the long vector issue in writeBin() and am using your write_bin() function. Thanks! Two minor comments in case you're interested:

wlandau commented 5 years ago

Feel free to take write_bin() and run with it. Now that we have https://github.com/fstpackage/fst/issues/201, I am no longer pursuing it. I agree that the time spent in seq() is counterintuitive given that I was using R >= 3.5.0 (I do not remember the patch version).

wlandau commented 4 years ago

By the way, there is a chance that long vector support for writeBin() may arrive in R 4.0.0 (https://github.com/HenrikBengtsson/Wishlist-for-R/issues/97#issuecomment-563441539). If that happens, we can avoid repacking large objects.

https://github.com/richfitz/storr/blob/0c64f1efb0574b46059a962de9d264a74d448843/R/hash.R#L97-L98

To preserve the old behavior in R 3.x.x, I propose we make a decision in .onLoad() about what do do with raw vectors and then cache the result as a logical in a local package environment (as done similarly here).