Closed kendonB closed 4 years ago
Implementation would be easy (perhaps too easy). But in the interest of controlling the number of formats we add so things do not get out of hand, I would first like to understand more about the behavior of qs
in practice. The benchmarks in the README do look promising, but they are only on one kind of data and on a very narrow size range. If we can cast a wider net and clearly describe the situations in which qs
excels the most, we can support qs
and offer a clear recommendation in the docs.
My own benchmarks so far are not that impressive. Perhaps we need larger and more complicated data for qs
to really shine. That's the sort of thing I would like to know.
library(microbenchmark)
library(qs)
#> qs v0.20.1: better serialization of S4 objects, see 'ChangeLog'
x <- 1
microbenchmark(
wb = writeBin(x, tempfile()),
rf = saveRDS(x, tempfile(), compress = FALSE),
qs = qsave(x, tempfile())
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> wb 28.759 30.8775 32.60451 32.296 33.7700 47.332 100
#> rf 29.086 30.6370 33.01477 31.796 32.6530 116.532 100
#> qs 45.168 47.2715 53.92454 48.364 50.1735 541.282 100
x <- runif(1e8)
microbenchmark(
wb = writeBin(x, tempfile()),
rf = saveRDS(x, tempfile(), compress = FALSE),
qs = qsave(x, tempfile()),
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> wb 623.1701 623.1701 623.1701 623.1701 623.1701 623.1701 1
#> rf 850.3506 850.3506 850.3506 850.3506 850.3506 850.3506 1
#> qs 1827.9655 1827.9655 1827.9655 1827.9655 1827.9655 1827.9655 1
I will note that part of the benefit of qs is the fast compression. I don't think saveRDS(x, tempfile(), compress = FALSE)
is the right comparison. My use of qs
in the wild has been quite impressive. Saving spatial formats for example.
Perhaps we could step back and consider an option that leverages rio
before going into a format rabbit hole.
Yeah, compression matters. In saveRDS()
, you either have massive runtime or a massive file. qsave()
seems to avoid both extremes. Perhaps a qs
backend is easy to justify after all.
library(qs)
#> qs v0.20.1: better serialization of S4 objects, see 'ChangeLog'
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
x <- runif(1e7)
object_size(x)
#> 80 MB
rf <- tempfile()
rt <- tempfile()
ql <- tempfile()
qz <- tempfile()
system.time(saveRDS(x, rf, compress = FALSE))
#> user system elapsed
#> 0.101 0.032 0.132
system.time(saveRDS(x, rt, compress = TRUE))
#> user system elapsed
#> 10.290 0.008 10.339
system.time(qsave(x, ql, algorithm = "lz4"))
#> user system elapsed
#> 0.187 0.052 0.239
system.time(qsave(x, qz, algorithm = "zstd"))
#> user system elapsed
#> 0.193 0.040 0.233
file.size(rf) / 1e6
#> [1] 80.00003
file.size(rt) / 1e6
#> [1] 53.25956
file.size(ql) / 1e6
#> [1] 41.80436
file.size(qz) / 1e6
#> [1] 41.80436
Created on 2019-12-21 by the reprex package (v0.3.0)
Prework
drake
's code of conduct.Proposal
This may or may not get implemented in
storr
as a backend for all files, but it may be worth just doing this directly in drake as an optionformat = "qs"
if it's easy enough?https://github.com/richfitz/storr/issues/104