ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

Add qs as a format option #1121

Closed kendonB closed 4 years ago

kendonB commented 4 years ago

Prework

Proposal

This may or may not get implemented in storr as a backend for all files, but it may be worth just doing this directly in drake as an option format = "qs" if it's easy enough?

https://github.com/richfitz/storr/issues/104

wlandau commented 4 years ago

Implementation would be easy (perhaps too easy). But in the interest of controlling the number of formats we add so things do not get out of hand, I would first like to understand more about the behavior of qs in practice. The benchmarks in the README do look promising, but they are only on one kind of data and on a very narrow size range. If we can cast a wider net and clearly describe the situations in which qs excels the most, we can support qs and offer a clear recommendation in the docs.

My own benchmarks so far are not that impressive. Perhaps we need larger and more complicated data for qs to really shine. That's the sort of thing I would like to know.

library(microbenchmark)
library(qs)
#> qs v0.20.1: better serialization of S4 objects, see 'ChangeLog'
x <- 1
microbenchmark(
  wb = writeBin(x, tempfile()),
  rf = saveRDS(x, tempfile(), compress = FALSE),
  qs = qsave(x, tempfile())
)
#> Unit: microseconds
#>  expr    min      lq     mean median      uq     max neval
#>    wb 28.759 30.8775 32.60451 32.296 33.7700  47.332   100
#>    rf 29.086 30.6370 33.01477 31.796 32.6530 116.532   100
#>    qs 45.168 47.2715 53.92454 48.364 50.1735 541.282   100
x <- runif(1e8)
microbenchmark(
  wb = writeBin(x, tempfile()),
  rf = saveRDS(x, tempfile(), compress = FALSE),
  qs = qsave(x, tempfile()),
  times = 1
)
#> Unit: milliseconds
#>  expr       min        lq      mean    median        uq       max neval
#>    wb  623.1701  623.1701  623.1701  623.1701  623.1701  623.1701     1
#>    rf  850.3506  850.3506  850.3506  850.3506  850.3506  850.3506     1
#>    qs 1827.9655 1827.9655 1827.9655 1827.9655 1827.9655 1827.9655     1
kendonB commented 4 years ago

I will note that part of the benefit of qs is the fast compression. I don't think saveRDS(x, tempfile(), compress = FALSE) is the right comparison. My use of qs in the wild has been quite impressive. Saving spatial formats for example.

Perhaps we could step back and consider an option that leverages rio before going into a format rabbit hole.

wlandau commented 4 years ago

Yeah, compression matters. In saveRDS(), you either have massive runtime or a massive file. qsave() seems to avoid both extremes. Perhaps a qs backend is easy to justify after all.

library(qs)
#> qs v0.20.1: better serialization of S4 objects, see 'ChangeLog'
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp

x <- runif(1e7)
object_size(x)
#> 80 MB

rf <- tempfile()
rt <- tempfile()
ql <- tempfile()
qz <- tempfile()

system.time(saveRDS(x, rf, compress = FALSE))
#>    user  system elapsed 
#>   0.101   0.032   0.132

system.time(saveRDS(x, rt, compress = TRUE))
#>    user  system elapsed 
#>  10.290   0.008  10.339

system.time(qsave(x, ql, algorithm = "lz4"))
#>    user  system elapsed 
#>   0.187   0.052   0.239

system.time(qsave(x, qz, algorithm = "zstd"))
#>    user  system elapsed 
#>   0.193   0.040   0.233

file.size(rf) / 1e6
#> [1] 80.00003

file.size(rt) / 1e6
#> [1] 53.25956

file.size(ql) / 1e6
#> [1] 41.80436

file.size(qz) / 1e6
#> [1] 41.80436

Created on 2019-12-21 by the reprex package (v0.3.0)