Closed moldach closed 5 years ago
For S4 objects (e.g. a ggplot) qs
just relies on default R serialization See: https://github.com/traversc/qs/issues/6. I'm working on more efficiently serializing S4 objects for the next version.
So currnetly it is expected that R serialization without compression is faster than R serialization with compression (qs default for S4 objects).
It is surprising that bench::mark
crashes, I'll look into that. It could just be a matter of calling garbage collection gc()
after every iteration.
Edit: I haven't been able to reproduce the crash using R 3.5.3 on a 16 Gb laptop.
Okay, so I took at look at #6 - It would be great if qs
had more efficient serialization in the future.
For example, the hawaii_agriculture_100m_basemap.rds
file is 356MB but with qsave(hawaii, preset = "high", "hawaii_compressed.qs")
the object is a whopping 1GB! If I set the parameter preset = "uncompressed"
it's 3.2 GB
I did a bit of digging why the the file size was so large. The ggplot object contains a reference its own environment, so the serialization gets a bit recursive, even with base R serialization.
Note that if you just save the data, the file size is a lot smaller regardless of method:
x <- readRDS("~/N/hawaii_agriculture_100m_basemap.rds")
mydata <- lapply(names(z), function(n) {if(n!="basemap") z[[n]]})
saveRDS(mydata, file="~/N/temp.rds")
qsave(mydata, file = "~/N/temp.rds")
> file.info("~/N/temp.rds")$size
[1] 73451268
> file.info("~/N/temp.qs")$size
[1] 72892924
So I would suggest saving just the data, rather than the ggplot object, which will be larger by a factor of 3-4x.
A little more detail if you are interested: https://github.com/tidyverse/ggplot2/issues/3619
At any rate, I'm still working on more efficiently serializing these types of complex objects.
Hi thanks for digging into this.
Just to make sure I'm understanding you correctly . You're saying you would do the entire generation of the basemap via a ggplot()
call in Shiny (the global environment) rather than saving it as a .Rds
?
For me there is two issues with this approach.
The first is that using a canvas speeds up rendering of plots (I've benchmarked this on my data too)
The second is that the spatial data required to make these ~15 basemaps is considerably large. I'm already at 80% of my Git Large File Storage quota without any of the data for these basemaps. I suppose I will need to find alternative ways store data with a project (without paying... Not sure what those are at the moment and it's beyond this issue
Thanks again!
In the ggplot issue link I posted, they suggested another solution, which was to remove the plot_env
directly. A hacky solution could be like this:
Take everything you need in the plot_env and put it into a list, and then save the list along with the base plot.
When you load in the data, reconstruct the plot_env.
This might take a little trial and error to get right, but I believe it would drastically improve file size and speed (again regardless of serialization method).
Hopefully this is now more efficient in version 0.20.1 (on github) -- should be on CRAN soon too. Check out the numbers below:
Write:
library(microbenchmark)
library(qs)
hawaii <- readRDS("~/N/hawaii.rds")
microbenchmark(saveRDS(hawaii, "/tmp/test.rds"), qsave(hawaii, "/tmp/test.qs"), times=3)
Unit: seconds
expr min lq mean median
saveRDS(hawaii, "/tmp/test.rds") 39.893819 39.925774 39.963411 39.957728
qsave(hawaii, "/tmp/test.qs") 6.532814 6.686664 6.742798 6.840515
uq max neval
39.99821 40.038684 3
6.84779 6.855064 3
Read:
microbenchmark(x1 <- readRDS("/tmp/test.rds"), x2 <- qread("/tmp/test.qs"), times=3)
Unit: seconds
expr min lq mean median
x1 <- readRDS("/tmp/test.rds") 11.352488 12.952269 14.231726 14.552049
x2 <- qread("/tmp/test.qs") 4.984628 5.635155 6.124476 6.285683
uq max neval
15.67134 16.790640 3
6.69440 7.103118 3
File sizes:
file.info("/tmp/test.rds")$size / 1e6 # size in Mb
[1] 373.6634
file.info("/tmp/test.qs")$size / 1e6 # size in Mb
[1] 369.7444
I'm trying to include some large
ggplot2
objects in a Shiny application and some folks suggested that I try out theqs
package instead of read/write with base R's.rds
.When I tried benchmarking I ran into unexpected results. It looks like qs is worse in terms of IO than with
.rds
files (especially whencompression = FALSE
): link to benchmarking figureThe second issue is that the
bench::mark()
function causesRStudio
to crash - I tried this on the weekend on my 8Gb laptop and then my 32Gb workstation at work today. Wasn't sure if this is a bug that should be reported to your package, their package (or both).Here's the code I ran:
The
system.time()
function is what I resorted to since I could not get information from {qs} frombench::mark()
. Here's mysessionInfo()
: