qsbase / qs

Quick serialization of R objects
405 stars 19 forks source link

qs is less efficient than rds in terms of IO for ggplot objects #21

Closed moldach closed 5 years ago

moldach commented 5 years ago

I'm trying to include some large ggplot2 objects in a Shiny application and some folks suggested that I try out the qs package instead of read/write with base R's .rds.

When I tried benchmarking I ran into unexpected results. It looks like qs is worse in terms of IO than with .rds files (especially when compression = FALSE): link to benchmarking figure

The second issue is that the bench::mark() function causes RStudio to crash - I tried this on the weekend on my 8Gb laptop and then my 32Gb workstation at work today. Wasn't sure if this is a bug that should be reported to your package, their package (or both).

Here's the code I ran:

library(bench)
library(qs)
library(sf)
library(cowplot)

# load ggplot
download.file("https://www.dropbox.com/s/ao0827vayr5u3vx/hawaii_agriculture_100m_basemap.rds?raw=1" , "hawaii_agriculture_100m_basemap.rds")
hawaii <- readRDS("hawaii_agriculture_100m_basemap.rds")

# bench mark saving
save_compressed <- bench::mark(saveRDS(hawaii, "hawaii_compressed.rds"), iterations = 50)
save_uncompressed <- bench::mark(saveRDS(hawaii, "hawaii_uncompressed.rds", compress = FALSE), iterations = 50)
save_qs <- bench::mark(qsave(hawaii, "hawaii.qs"), iterations = 50)

# bench mark reading
read_compressed <- bench::mark(hawaii <- readRDS("hawaii_compressed.rds"), iterations = 50)
read_uncompressed <- bench::mark(hawaii <- readRDS("hawaii_uncompressed.rds"), iterations = 50)
read_qs <- bench::mark(hawaii <- qread("hawaii.qs"), iterations = 50)

# combine
bench_save <- rbind(save_compressed, save_uncompressed, save_qs)
bench_read <- rbind(read_compressed, bench_read_uncompressed, read_qs)

# resort to system.time

# bench mark saving
system.time(saveRDS(hawaii, "hawaii_compressed.rds"))
system.time(saveRDS(hawaii, "hawaii_uncompressed.rds", compress = FALSE))
system.time(qsave(hawaii, "hawaii.qs"))  # this freezes my 8Gb laptop and 32Gb workstation

# bench mark reading
system.time(hawaii <- readRDS("hawaii_compressed.rds"))
system.time(hawaii <- readRDS("hawaii_uncompressed.rds"))
system.time(hawaii <- qread("hawaii.qs"))

plot_01 <- df %>% dplyr::filter(io == "write") %>% ggplot(., aes(x=method, y=time, fill=method)) + 
  geom_bar(stat = "identity") + ggtitle("write time in seconds") + scale_fill_brewer(palette = "Set1") + theme_light() + theme(legend.position="none")

plot_02 <- df %>% dplyr::filter(io == "read") %>% ggplot(., aes(x=method, y=time, fill=method)) + 
  geom_bar(stat = "identity") + ggtitle("read time in seconds") + scale_fill_brewer(palette = "Set1") + theme_light() + theme(legend.position="none")

plot_grid(plot_01, plot_02)

The system.time() function is what I resorted to since I could not get information from {qs} from bench::mark(). Here's my sessionInfo():

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /home/tsundoku/anaconda3/lib/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] cowplot_1.0.0  ggplot2_3.2.1  showtext_0.7   showtextdb_2.0 sysfonts_0.8   sf_0.8-0       qs_0.19.1      bench_1.0.4   

loaded via a namespace (and not attached):
  [1] nlme_3.1-141        tsne_0.1-3          bitops_1.0-6        RcppAnnoy_0.0.13    RColorBrewer_1.1-2  httr_1.4.1         
  [7] sctransform_0.2.0   tools_3.6.1         backports_1.1.5     R6_2.4.0            irlba_2.3.3         KernSmooth_2.23-16 
 [13] DBI_1.0.0           uwot_0.1.4          lazyeval_0.2.2      colorspace_1.4-1    withr_2.1.2         npsurv_0.4-0       
 [19] tidyselect_0.2.5    gridExtra_2.3       compiler_3.6.1      plotly_4.9.0        labeling_0.3        Seurat_3.1.1       
 [25] caTools_1.17.1.2    scales_1.0.0        classInt_0.4-2      lmtest_0.9-37       ggridges_0.5.1      pbapply_1.4-2      
 [31] stringr_1.4.0       digest_0.6.22       R.utils_2.9.0       pkgconfig_2.0.3     htmltools_0.4.0     bibtex_0.4.2       
 [37] htmlwidgets_1.5.1   rlang_0.4.1         rstudioapi_0.10     RApiSerialize_0.1.0 zoo_1.8-6           jsonlite_1.6       
 [43] ica_1.0-2           gtools_3.8.1        dplyr_0.8.3         R.oo_1.22.0         magrittr_1.5        Matrix_1.2-17      
 [49] Rcpp_1.0.2          munsell_0.5.0       ape_5.3             reticulate_1.13     lifecycle_0.1.0     R.methodsS3_1.7.1  
 [55] stringi_1.4.3       gbRd_0.4-11         MASS_7.3-51.4       gplots_3.0.1.1      Rtsne_0.15          plyr_1.8.4         
 [61] grid_3.6.1          parallel_3.6.1      gdata_2.18.0        listenv_0.7.0       ggrepel_0.8.1       crayon_1.3.4       
 [67] lattice_0.20-38     splines_3.6.1       SDMTools_1.1-221.1  zeallot_0.1.0       pillar_1.4.2        igraph_1.2.4.1     
 [73] future.apply_1.3.0  reshape2_1.4.3      codetools_0.2-16    leiden_0.3.1        glue_1.3.1          lsei_1.2-0         
 [79] metap_1.1           RcppParallel_4.4.4  data.table_1.12.6   vctrs_0.2.0         png_0.1-7           Rdpack_0.11-0      
 [85] gtable_0.3.0        RANN_2.6.1          purrr_0.3.3         tidyr_1.0.0         future_1.14.0       assertthat_0.2.1   
 [91] rsvd_1.0.2          e1071_1.7-2         class_7.3-15        survival_2.44-1.1   viridisLite_0.3.0   tibble_2.1.3       
 [97] units_0.6-5         cluster_2.1.0       globals_0.12.4      fitdistrplus_1.0-14 ROCR_1.0-7     
traversc commented 5 years ago

For S4 objects (e.g. a ggplot) qs just relies on default R serialization See: https://github.com/traversc/qs/issues/6. I'm working on more efficiently serializing S4 objects for the next version.

So currnetly it is expected that R serialization without compression is faster than R serialization with compression (qs default for S4 objects).

It is surprising that bench::mark crashes, I'll look into that. It could just be a matter of calling garbage collection gc() after every iteration.

Edit: I haven't been able to reproduce the crash using R 3.5.3 on a 16 Gb laptop.

moldach commented 5 years ago

Okay, so I took at look at #6 - It would be great if qs had more efficient serialization in the future.

For example, the hawaii_agriculture_100m_basemap.rds file is 356MB but with qsave(hawaii, preset = "high", "hawaii_compressed.qs") the object is a whopping 1GB! If I set the parameter preset = "uncompressed" it's 3.2 GB

traversc commented 5 years ago

I did a bit of digging why the the file size was so large. The ggplot object contains a reference its own environment, so the serialization gets a bit recursive, even with base R serialization.

Note that if you just save the data, the file size is a lot smaller regardless of method:

x <- readRDS("~/N/hawaii_agriculture_100m_basemap.rds")
mydata <- lapply(names(z), function(n) {if(n!="basemap") z[[n]]})
saveRDS(mydata, file="~/N/temp.rds")
qsave(mydata, file = "~/N/temp.rds")

> file.info("~/N/temp.rds")$size
[1] 73451268
> file.info("~/N/temp.qs")$size
[1] 72892924

So I would suggest saving just the data, rather than the ggplot object, which will be larger by a factor of 3-4x.

A little more detail if you are interested: https://github.com/tidyverse/ggplot2/issues/3619

At any rate, I'm still working on more efficiently serializing these types of complex objects.

moldach commented 5 years ago

Hi thanks for digging into this.

Just to make sure I'm understanding you correctly . You're saying you would do the entire generation of the basemap via a ggplot() call in Shiny (the global environment) rather than saving it as a .Rds?

For me there is two issues with this approach.

Thanks again!

traversc commented 5 years ago

In the ggplot issue link I posted, they suggested another solution, which was to remove the plot_env directly. A hacky solution could be like this:

Take everything you need in the plot_env and put it into a list, and then save the list along with the base plot.

When you load in the data, reconstruct the plot_env.

This might take a little trial and error to get right, but I believe it would drastically improve file size and speed (again regardless of serialization method).

traversc commented 5 years ago

Hopefully this is now more efficient in version 0.20.1 (on github) -- should be on CRAN soon too. Check out the numbers below:

Write:

library(microbenchmark)
library(qs)

hawaii <- readRDS("~/N/hawaii.rds")
microbenchmark(saveRDS(hawaii, "/tmp/test.rds"), qsave(hawaii, "/tmp/test.qs"), times=3)

Unit: seconds
expr       min        lq      mean    median
saveRDS(hawaii, "/tmp/test.rds") 39.893819 39.925774 39.963411 39.957728
qsave(hawaii, "/tmp/test.qs")  6.532814  6.686664  6.742798  6.840515
uq       max neval
39.99821 40.038684     3
6.84779  6.855064     3

Read:

microbenchmark(x1 <- readRDS("/tmp/test.rds"), x2 <- qread("/tmp/test.qs"), times=3)

Unit: seconds
expr       min        lq      mean    median
x1 <- readRDS("/tmp/test.rds") 11.352488 12.952269 14.231726 14.552049
x2 <- qread("/tmp/test.qs")  4.984628  5.635155  6.124476  6.285683
uq       max neval
15.67134 16.790640     3
6.69440  7.103118     3

File sizes:

file.info("/tmp/test.rds")$size / 1e6 # size in Mb
[1] 373.6634

file.info("/tmp/test.qs")$size / 1e6 # size in Mb
[1] 369.7444