Open wlandau opened 5 years ago
Hi @wlandau, if I understand correctly, you will always be serializing the complete raw
vector to disk as a whole (just like saveRDS()
.
Because you do not need random access to the stored vector, the meta-data stored in the fst file is a bit of an overkill. You might also test a setup where you compress the raw
vector using compress_fst()
and then store the result using saveRDS(raw_vec, compress = FALSE)
.
write_fst()
, so it might be a little slower.perhaps you can add this option to the benchmark and see what happens!
best
Thanks for the useful advice, @MarcusKlik! Looks like we might be able to use fst
compression in the existing RDS storr
and reap most/all of the benefits. Really exciting!
library(fst)
library(magrittr)
library(microbenchmark)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
library(storr)
data <- runif(2.5e8)
object_size(data)
#> 2 GB
serialize(data, NULL, ascii = FALSE) %>%
compress_fst() %>%
object_size()
#> 1.28 GB
s1 <- storr_fst(tempfile())
s2 <- storr_fst(tempfile(), compress = TRUE)
s3 <- storr_rds(tempfile())
f <- function(data) {
raw <- serialize(data, NULL, ascii = FALSE)
cmp <- compress_fst(raw)
saveRDS(cmp, tempfile(), compress = FALSE)
}
microbenchmark(
fst = s1$set("x", data),
fst_cmp = s2$set("x", data),
rds = s3$set("x", data),
f = f(data),
times = 1
)
#> Unit: seconds
#> expr min lq mean median uq max
#> fst 12.043528 12.043528 12.043528 12.043528 12.043528 12.043528
#> fst_cmp 12.056381 12.056381 12.056381 12.056381 12.056381 12.056381
#> rds 206.944071 206.944071 206.944071 206.944071 206.944071 206.944071
#> f 8.960977 8.960977 8.960977 8.960977 8.960977 8.960977
#> neval
#> 1
#> 1
#> 1
#> 1
Created on 2019-06-18 by the reprex package (v0.3.0)
@MarcusKlik, I think https://github.com/richfitz/storr/pull/109#issuecomment-503054001 is a the best answer to https://stackoverflow.com/questions/56614592/faster-way-to-slice-a-raw-vector. If you post, I will accept and grant you the reputation points.
After https://github.com/fstpackage/fst/commit/c8bfde98eab20da9224e934254ae2bff12378867, https://github.com/richfitz/storr/pull/109/commits/519c38de8adb8e038ab17d08834405eee6adf43b, and https://github.com/richfitz/storr/pull/109/commits/d06e6f7e40a119636bf06850bc26cc4d3e7195ba, storr_fst()
can handle data over 2 GB.
library(digest)
library(fst)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
library(storr)
x <- runif(3e8)
object_size(x)
#> 2.4 GB
s <- storr_fst(tempfile())
s$set(key = "x", value = x, use_cache = FALSE)
y <- s$get("x", use_cache = FALSE)
digest(x) == digest(y)
#> [1] TRUE
Created on 2019-06-18 by the reprex package (v0.3.0)
After looking at https://github.com/fstpackage/fst/commit/c8bfde98eab20da9224e934254ae2bff12378867, I merged #109 (storr_fst()
, powered by write_fst()
and read_fst()
) and #111 (compress_fst()
within RDS storr
s) into https://github.com/wlandau/storr/tree/fst-compare for the sake of head-to-head benchmarking. Below, the most useful comparison is fst_none
(storr_fst()
, powered by write_fst()
and read_fst()
) versus rds_cmpr
(compress_fst()
withinin RDS storr
s). It looks like we can save large-ish data faster with #109, but we can read small data faster with #111. Overall, I would say #109 wins in terms of speed, but not by much, at least compared to the current default behavior of storr
.
So do we go with #109 or #111? Maybe both? #109 is a bit faster for large data, but the implementation is more complicated, and it requires development fst
at the moment. #111 keeps the storr
internals simpler, and it opens up the compress
argument to allow more types of compression options in future development.
+2 GB data benchmarks:
library(digest)
library(fst)
library(microbenchmark)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
library(storr)
x <- runif(3e8)
object_size(x)
#> 2.4 GB
fst_none <- storr_fst(tempfile(), compress = "none")
fst_cmpr <- storr_fst(tempfile(), compress = "fst")
rds_none <- storr_rds(tempfile(), compress = "none")
rds_cmpr <- storr_rds(tempfile(), compress = "fst")
rds_default <- storr_rds(tempfile(), compress = "gzfile")
microbenchmark::microbenchmark(
fst_none = fst_none$set(key = "x", value = x, use_cache = FALSE),
fst_cmpr = fst_cmpr$set(key = "x", value = x, use_cache = FALSE),
rds_none = rds_none$set(key = "x", value = x, use_cache = FALSE),
rds_cmpr = rds_cmpr$set(key = "x", value = x, use_cache = FALSE),
rds_default = rds_default$set(key = "x", value = x, use_cache = FALSE),
times = 1
)
#> Repacking large object
#> Repacking large object
#> Unit: seconds
#> expr min lq mean median uq
#> fst_none 9.943442 9.943442 9.943442 9.943442 9.943442
#> fst_cmpr 16.299916 16.299916 16.299916 16.299916 16.299916
#> rds_none 13.164031 13.164031 13.164031 13.164031 13.164031
#> rds_cmpr 16.212105 16.212105 16.212105 16.212105 16.212105
#> rds_default 263.741792 263.741792 263.741792 263.741792 263.741792
#> max neval
#> 9.943442 1
#> 16.299916 1
#> 13.164031 1
#> 16.212105 1
#> 263.741792 1
microbenchmark::microbenchmark(
fst_none = fst_none$get(key = "x", use_cache = FALSE),
fst_cmpr = fst_cmpr$get(key = "x", use_cache = FALSE),
rds_none = rds_none$get(key = "x", use_cache = FALSE),
rds_cmpr = rds_cmpr$get(key = "x", use_cache = FALSE),
rds_default = rds_default$get(key = "x", use_cache = FALSE),
times = 1
)
#> Unit: seconds
#> expr min lq mean median uq max
#> fst_none 3.997610 3.997610 3.997610 3.997610 3.997610 3.997610
#> fst_cmpr 9.683951 9.683951 9.683951 9.683951 9.683951 9.683951
#> rds_none 2.891874 2.891874 2.891874 2.891874 2.891874 2.891874
#> rds_cmpr 4.645483 4.645483 4.645483 4.645483 4.645483 4.645483
#> rds_default 12.585657 12.585657 12.585657 12.585657 12.585657 12.585657
#> neval
#> 1
#> 1
#> 1
#> 1
#> 1
Created on 2019-06-18 by the reprex package (v0.3.0)
Small data benchmarks:
library(digest)
library(fst)
library(microbenchmark)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
library(storr)
fst_none <- storr_fst(tempfile(), compress = "none")
fst_cmpr <- storr_fst(tempfile(), compress = "fst")
rds_none <- storr_rds(tempfile(), compress = "none")
rds_cmpr <- storr_rds(tempfile(), compress = "fst")
rds_default <- storr_rds(tempfile(), compress = "gzfile")
microbenchmark::microbenchmark(
fst_none = fst_none$set(key = "x", value = runif(1), use_cache = FALSE),
fst_cmpr = fst_cmpr$set(key = "x", value = runif(1), use_cache = FALSE),
rds_none = rds_none$set(key = "x", value = runif(1), use_cache = FALSE),
rds_cmpr = rds_cmpr$set(key = "x", value = runif(1), use_cache = FALSE),
rds_default = rds_default$set(key = "x", value = runif(1), use_cache = FALSE)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> fst_none 264.506 273.3300 291.2796 287.7015 304.0750 488.368 100
#> fst_cmpr 284.696 292.3405 333.9626 312.5445 326.6185 1240.974 100
#> rds_none 231.126 240.5605 256.2129 252.6815 266.9450 438.347 100
#> rds_cmpr 249.724 257.7000 301.7618 267.2055 285.0685 2301.713 100
#> rds_default 239.317 253.3230 295.7192 300.4380 319.8265 815.004 100
microbenchmark::microbenchmark(
fst_none = fst_none$get(key = "x", use_cache = FALSE),
fst_cmpr = fst_cmpr$get(key = "x", use_cache = FALSE),
rds_none = rds_none$get(key = "x", use_cache = FALSE),
rds_cmpr = rds_cmpr$get(key = "x", use_cache = FALSE),
rds_default = rds_default$get(key = "x", use_cache = FALSE)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> fst_none 119.854 127.4390 484.00473 131.7765 144.9625 34477.090 100
#> fst_cmpr 125.987 131.3730 144.91864 135.6095 143.7470 237.292 100
#> rds_none 76.280 81.6835 115.04263 85.8315 93.7595 2406.733 100
#> rds_cmpr 91.058 95.7800 111.86227 99.9195 106.4455 513.389 100
#> rds_default 76.906 81.3260 90.50064 86.5245 93.3555 156.133 100
Created on 2019-06-18 by the reprex package (v0.3.0)
This PR is a sketch of the
fst
driver I suggested in #108. It completely based on the RDS driver. Instead of saving the serialized raw vector withwriteBin()
, we wrap it in a data frame and save it withwrite_fst()
. Optional compression is powered bycompress_fst()
. If this works out, I plan to add functionality to set the serialization method based on the class/type of the object, re https://github.com/richfitz/storr/issues/77#issuecomment-476275570 (thinking of Keras models).A quick glance at speed is encouraging (with 4 OpenMP threads on a mid-tier Kubuntu machine). cc @MarcusKlik.
Created on 2019-06-17 by the reprex package (v0.3.0)
Issues and questions
R6_driver_fst
currently inherits fromR6_driver_rds
. Should it? A lot of code needs to be copied anyway because of hard-coded references to RDS, and I am not sure I feel about the current one-foot-in-and-one-foot-out approach.