traversc / qs

Quick serialization of R objects
397 stars 19 forks source link

Memory bloat on .q file load #10

Closed leungi closed 5 years ago

leungi commented 5 years ago

Hi,

I was promoting your qs package to another user, when he realized that upon loading .q files into R session, the object size may be bloated by 2-3x.

I validated his findings with my own .q files.

The file size shown in RStudio Environment pane may be misleading (reflecting similar size as saved object), but inspecting object size via object.size() and checking memory use via Windows Resource Monitor confirmed this bloating issue.

traversc commented 5 years ago

Please try the use_alt_rep=F parameter in qread. That should reduce memory usage. However, object.size should not show any difference, regardless of whether using qs or not.

R also has a garbage collection setup, which may not happen right away. You can call it with gc() (although that's an R issue).

Memory usage may look higher than reality if R hasn't run gc in the background in a while.

leungi commented 5 years ago

@traversc: thanks for prompt reply.

object.size() did return equal size; must've been lingering variables in environment.

However, Windows memory usage bloat persist.

> Sys.getpid()
[1] 29440
> 
> data_FALSE <- qs::qread('test.q', use_alt_rep = FALSE)
> 
> data_TRUE <- qs::qread('test.q', use_alt_rep = TRUE)
> 
> identical(data_FALSE, data_TRUE)
[1] TRUE
> 
> object.size(data_FALSE)
2770498608 bytes
> object.size(data_TRUE)
2770498608 bytes

Running gc() has no effect.

qs_bloat

traversc commented 5 years ago

Does the memory issue persist if you start a new session and only use use_alt_rep=F?

leungi commented 5 years ago

Same issue (screenshot attached).

> Sys.getpid()
[1] 29500
> data_FALSE <- qs::qread('test.q', use_alt_rep = FALSE)
> object.size(data_FALSE)
2770498608 bytes
> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   1424532   76.1    2213915  118.3   2213915  118.3
Vcells 340659266 2599.1  582678634 4445.5 447116858 3411.3

qs_bloat

traversc commented 5 years ago

I believe that to be in line with expectation.

I ran the following test (on Mac right now, but will try it out on windows later)

Generate data -- about 2.1 GB data frame according to object.size

library(dplyr)
library(qs)
z <- starnames %>% dplyr::sample_n(3e7, replace=T)
qsave(z, file="/tmp/test.z")
saveRDS(z, file="/tmp/test.rds")

New session:

library(qs)
z <- readRDS("/tmp/test.rds")
gc()

Memory usage according to ps command: 2737700 kb (1024 bytes)

New session:

library(qs)
z <- qread("/tmp/test.z", use_alt_rep = F)
gc()

Memory usage according to ps command: 2475848 kb (1024 bytes)

So there is no difference in memory usage (in fact using qs is lower than readRDS) but both are significantly higher than just by object.size.

I suspect this is because of how R works. Speaking outside of my knowledge now, but I suspect R reserves more memory than it is currently using, so it can quickly provision memory for any new object.

Regardless, I believe it isn't a qs issue. Let me know if you disagree or if there is anything else.

leungi commented 5 years ago

Using use_alt_rep = F option does resolve this issue, as you suggested.

Closing this; appreciate your investigation!