richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
116 stars 10 forks source link

Sketch fst driver #109

Open wlandau opened 5 years ago

wlandau commented 5 years ago

This PR is a sketch of the fst driver I suggested in #108. It completely based on the RDS driver. Instead of saving the serialized raw vector with writeBin(), we wrap it in a data frame and save it with write_fst(). Optional compression is powered by compress_fst(). If this works out, I plan to add functionality to set the serialization method based on the class/type of the object, re https://github.com/richfitz/storr/issues/77#issuecomment-476275570 (thinking of Keras models).

A quick glance at speed is encouraging (with 4 OpenMP threads on a mid-tier Kubuntu machine). cc @MarcusKlik.

library(microbenchmark)
library(storr)
data <- runif(2.5e7)
pryr::object_size(data)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> 200 MB
s1 <- storr_fst(tempfile())
s2 <- storr_rds(tempfile())
microbenchmark(
  fst = s1$set("x", data),
  rds = s2$set("x", data),
  times = 1
)
#> Unit: seconds
#>  expr       min        lq      mean    median        uq       max neval
#>   fst  1.568269  1.568269  1.568269  1.568269  1.568269  1.568269     1
#>   rds 22.373394 22.373394 22.373394 22.373394 22.373394 22.373394     1

Created on 2019-06-17 by the reprex package (v0.3.0)

Issues and questions

MarcusKlik commented 5 years ago

Hi @wlandau, if I understand correctly, you will always be serializing the complete raw vector to disk as a whole (just like saveRDS().

Because you do not need random access to the stored vector, the meta-data stored in the fst file is a bit of an overkill. You might also test a setup where you compress the raw vector using compress_fst() and then store the result using saveRDS(raw_vec, compress = FALSE).

perhaps you can add this option to the benchmark and see what happens!

best

wlandau commented 5 years ago

Thanks for the useful advice, @MarcusKlik! Looks like we might be able to use fst compression in the existing RDS storr and reap most/all of the benefits. Really exciting!

library(fst)
library(magrittr)
library(microbenchmark)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
library(storr)
data <- runif(2.5e8)
object_size(data)
#> 2 GB
serialize(data, NULL, ascii = FALSE) %>%
  compress_fst() %>%
  object_size()
#> 1.28 GB
s1 <- storr_fst(tempfile())
s2 <- storr_fst(tempfile(), compress = TRUE)
s3 <- storr_rds(tempfile())
f <- function(data) {
  raw <- serialize(data, NULL, ascii = FALSE)
  cmp <- compress_fst(raw)
  saveRDS(cmp, tempfile(), compress = FALSE)
}
microbenchmark(
  fst = s1$set("x", data),
  fst_cmp = s2$set("x", data),
  rds = s3$set("x", data),
  f = f(data),
  times = 1
)
#> Unit: seconds
#>     expr        min         lq       mean     median         uq        max
#>      fst  12.043528  12.043528  12.043528  12.043528  12.043528  12.043528
#>  fst_cmp  12.056381  12.056381  12.056381  12.056381  12.056381  12.056381
#>      rds 206.944071 206.944071 206.944071 206.944071 206.944071 206.944071
#>        f   8.960977   8.960977   8.960977   8.960977   8.960977   8.960977
#>  neval
#>      1
#>      1
#>      1
#>      1

Created on 2019-06-18 by the reprex package (v0.3.0)

wlandau commented 5 years ago

@MarcusKlik, I think https://github.com/richfitz/storr/pull/109#issuecomment-503054001 is a the best answer to https://stackoverflow.com/questions/56614592/faster-way-to-slice-a-raw-vector. If you post, I will accept and grant you the reputation points.

wlandau commented 5 years ago

After https://github.com/fstpackage/fst/commit/c8bfde98eab20da9224e934254ae2bff12378867, https://github.com/richfitz/storr/pull/109/commits/519c38de8adb8e038ab17d08834405eee6adf43b, and https://github.com/richfitz/storr/pull/109/commits/d06e6f7e40a119636bf06850bc26cc4d3e7195ba, storr_fst() can handle data over 2 GB.

library(digest)
library(fst)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
library(storr)
x <- runif(3e8)
object_size(x)
#> 2.4 GB
s <- storr_fst(tempfile())
s$set(key = "x", value = x, use_cache = FALSE)
y <- s$get("x", use_cache = FALSE)
digest(x) == digest(y)
#> [1] TRUE

Created on 2019-06-18 by the reprex package (v0.3.0)

wlandau commented 5 years ago

After looking at https://github.com/fstpackage/fst/commit/c8bfde98eab20da9224e934254ae2bff12378867, I merged #109 (storr_fst(), powered by write_fst() and read_fst()) and #111 (compress_fst() within RDS storrs) into https://github.com/wlandau/storr/tree/fst-compare for the sake of head-to-head benchmarking. Below, the most useful comparison is fst_none (storr_fst(), powered by write_fst() and read_fst()) versus rds_cmpr (compress_fst() withinin RDS storrs). It looks like we can save large-ish data faster with #109, but we can read small data faster with #111. Overall, I would say #109 wins in terms of speed, but not by much, at least compared to the current default behavior of storr.

So do we go with #109 or #111? Maybe both? #109 is a bit faster for large data, but the implementation is more complicated, and it requires development fst at the moment. #111 keeps the storr internals simpler, and it opens up the compress argument to allow more types of compression options in future development.

+2 GB data benchmarks:

library(digest)
library(fst)
library(microbenchmark)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
library(storr)
x <- runif(3e8)
object_size(x)
#> 2.4 GB
fst_none <- storr_fst(tempfile(), compress = "none")
fst_cmpr <- storr_fst(tempfile(), compress = "fst")
rds_none <- storr_rds(tempfile(), compress = "none")
rds_cmpr <- storr_rds(tempfile(), compress = "fst")
rds_default <- storr_rds(tempfile(), compress = "gzfile")
microbenchmark::microbenchmark(
  fst_none = fst_none$set(key = "x", value = x, use_cache = FALSE),
  fst_cmpr = fst_cmpr$set(key = "x", value = x, use_cache = FALSE),
  rds_none = rds_none$set(key = "x", value = x, use_cache = FALSE),
  rds_cmpr = rds_cmpr$set(key = "x", value = x, use_cache = FALSE),
  rds_default = rds_default$set(key = "x", value = x, use_cache = FALSE),
  times = 1
)
#> Repacking large object
#> Repacking large object
#> Unit: seconds
#>         expr        min         lq       mean     median         uq
#>     fst_none   9.943442   9.943442   9.943442   9.943442   9.943442
#>     fst_cmpr  16.299916  16.299916  16.299916  16.299916  16.299916
#>     rds_none  13.164031  13.164031  13.164031  13.164031  13.164031
#>     rds_cmpr  16.212105  16.212105  16.212105  16.212105  16.212105
#>  rds_default 263.741792 263.741792 263.741792 263.741792 263.741792
#>         max neval
#>    9.943442     1
#>   16.299916     1
#>   13.164031     1
#>   16.212105     1
#>  263.741792     1
microbenchmark::microbenchmark(
  fst_none = fst_none$get(key = "x", use_cache = FALSE),
  fst_cmpr = fst_cmpr$get(key = "x", use_cache = FALSE),
  rds_none = rds_none$get(key = "x", use_cache = FALSE),
  rds_cmpr = rds_cmpr$get(key = "x", use_cache = FALSE),
  rds_default = rds_default$get(key = "x", use_cache = FALSE),
  times = 1
)
#> Unit: seconds
#>         expr       min        lq      mean    median        uq       max
#>     fst_none  3.997610  3.997610  3.997610  3.997610  3.997610  3.997610
#>     fst_cmpr  9.683951  9.683951  9.683951  9.683951  9.683951  9.683951
#>     rds_none  2.891874  2.891874  2.891874  2.891874  2.891874  2.891874
#>     rds_cmpr  4.645483  4.645483  4.645483  4.645483  4.645483  4.645483
#>  rds_default 12.585657 12.585657 12.585657 12.585657 12.585657 12.585657
#>  neval
#>      1
#>      1
#>      1
#>      1
#>      1

Created on 2019-06-18 by the reprex package (v0.3.0)

Small data benchmarks:

library(digest)
library(fst)
library(microbenchmark)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
library(storr)
fst_none <- storr_fst(tempfile(), compress = "none")
fst_cmpr <- storr_fst(tempfile(), compress = "fst")
rds_none <- storr_rds(tempfile(), compress = "none")
rds_cmpr <- storr_rds(tempfile(), compress = "fst")
rds_default <- storr_rds(tempfile(), compress = "gzfile")
microbenchmark::microbenchmark(
  fst_none = fst_none$set(key = "x", value = runif(1), use_cache = FALSE),
  fst_cmpr = fst_cmpr$set(key = "x", value = runif(1), use_cache = FALSE),
  rds_none = rds_none$set(key = "x", value = runif(1), use_cache = FALSE),
  rds_cmpr = rds_cmpr$set(key = "x", value = runif(1), use_cache = FALSE),
  rds_default = rds_default$set(key = "x", value = runif(1), use_cache = FALSE)
)
#> Unit: microseconds
#>         expr     min       lq     mean   median       uq      max neval
#>     fst_none 264.506 273.3300 291.2796 287.7015 304.0750  488.368   100
#>     fst_cmpr 284.696 292.3405 333.9626 312.5445 326.6185 1240.974   100
#>     rds_none 231.126 240.5605 256.2129 252.6815 266.9450  438.347   100
#>     rds_cmpr 249.724 257.7000 301.7618 267.2055 285.0685 2301.713   100
#>  rds_default 239.317 253.3230 295.7192 300.4380 319.8265  815.004   100
microbenchmark::microbenchmark(
  fst_none = fst_none$get(key = "x", use_cache = FALSE),
  fst_cmpr = fst_cmpr$get(key = "x", use_cache = FALSE),
  rds_none = rds_none$get(key = "x", use_cache = FALSE),
  rds_cmpr = rds_cmpr$get(key = "x", use_cache = FALSE),
  rds_default = rds_default$get(key = "x", use_cache = FALSE)
)
#> Unit: microseconds
#>         expr     min       lq      mean   median       uq       max neval
#>     fst_none 119.854 127.4390 484.00473 131.7765 144.9625 34477.090   100
#>     fst_cmpr 125.987 131.3730 144.91864 135.6095 143.7470   237.292   100
#>     rds_none  76.280  81.6835 115.04263  85.8315  93.7595  2406.733   100
#>     rds_cmpr  91.058  95.7800 111.86227  99.9195 106.4455   513.389   100
#>  rds_default  76.906  81.3260  90.50064  86.5245  93.3555   156.133   100

Created on 2019-06-18 by the reprex package (v0.3.0)