qsbase / qs

Quick serialization of R objects
405 stars 19 forks source link

when using a for-loop and creating a variable by reference (data.table), no variable is created #22

Closed emilBeBri closed 5 years ago

emilBeBri commented 5 years ago

Hi, so, when loading a data.table saved as an qs-file, and using a for-loop with by-reference created variables, the package fails, like so:

dat1 <- data.table(id=c(1,1,1,2,2,2), runif(6,0,5))
dat2 <- data.table(id=c(3,3,3,4,4,4), runif(6,0,5))
qsave(dat1, './dat1.qs', preset='high')
dat1 <- qread('./dat1.qs')
qsave(dat2, './dat2.qs', preset='high')
dat2 <- qread('./dat2.qs')

for (DT in c('dat1', 'dat2')) {
    get(DT)[, newcol := 1]
}

if, however, one does not save it as a qs-object, or, if not doing the variable creation in a loop, but for each indiviual data.table, everything is fine. I tried using the argument use_alt_rep=F as well, but that does not help. if one uses the copy() function just after loading the qs-objects, it also works, but that seems very inefficient on big data.tables.

greetings, Emil

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.04

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2019.2.187/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8    
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8   
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.6 qs_0.19.1        

loaded via a namespace (and not attached):
[1] colorspace_1.4-1    bit_1.1-14          compiler_3.6.1     
[4] Rcpp_1.0.2          bit64_0.9-7         RApiSerialize_0.1.0
traversc commented 5 years ago

Hi @emilBeBri , this is an issue with data.table and saving to disk, not qs. You can see that even with other serialization methods, you get the same issue:

library(data.table)

dat1 <- data.table(id=c(1,1,1,2,2,2), runif(6,0,5), z = letters[1:6])
dat2 <- data.table(id=c(3,3,3,4,4,4), runif(6,0,5), z = letters[1:6])
saveRDS(dat1, file = "/tmp/dat1.rds")
dat1 <- readRDS("/tmp/dat1.rds")
saveRDS(dat2, file = "/tmp/dat2.rds")
dat2 <- readRDS("/tmp/dat2.rds")

for (DT in c('dat1', 'dat2')) {
  get(DT)[, newcol := 1]
}

dat1
   id        V2 z
1:  1 2.2551426 a
2:  1 0.3937463 b
3:  1 1.4704248 c
4:  2 4.7696833 d
5:  2 3.8110676 e
6:  2 4.4503739 f

The reason is data.table uses C reference pointers to places in memory:

> attributes(dat1)
...
$.internal.selfref
<pointer: 0x0>

A quick fix would be to re-wrap the data.table:

library(data.table)
library(qs)

dat1 <- data.table(id=c(1,1,1,2,2,2), runif(6,0,5))
dat2 <- data.table(id=c(3,3,3,4,4,4), runif(6,0,5))
qsave(dat1, './dat1.qs', preset='high')
dat1 <- data.table(qread('./dat1.qs'))
qsave(dat2, './dat2.qs', preset='high')
dat2 <- data.table(qread('./dat2.qs'))

for (DT in c('dat1', 'dat2')) {
  get(DT)[, newcol := 1]
}

> dat1
id         V2 newcol
1:  1 0.08454065      1
2:  1 4.36837604      1
3:  1 3.49920527      1
4:  2 0.02507492      1
5:  2 1.65069644      1
6:  2 1.29881559      1

Hopefully that answers your question. If it does, please feel free to close :)

emilBeBri commented 5 years ago

nice! Wrapping it in data.table is a neat trick to circumvent this. Just checked on the example data: You can also do it with setDT() and thereby not making any copiyng at all , so this is probably the most efficient solution (although perhaps more error-prone?)