union_all() fills up disk when passed many tables - Githubissues

hannes / MonetDBLite

MonetDB reconfigured as a library

108 stars 11 forks source link

union_all() fills up disk when passed many tables #197

Open nalimilan opened 6 years ago

nalimilan commented 6 years ago

I need to concatenate about 160 tables stored in a MonetDBLite database into a single table. I used to do so with the SQLite backend to dplyr, like this: Reduce(union_all, tables_list). Unfortunately, this doesn't work with MonetDBLite: the database grows from 3.5GB to 70GB, and the merge fails after there's no free disk space left.

I've eventually found a workaround by splitting the operation in smaller parts:

indiv1 <- compute(Reduce(union_all, indl[1:50]))
indiv2 <- compute(Reduce(union_all, indl[51:100]))
indiv3 <- compute(Reduce(union_all, indl[101:150]))
indiv4 <- compute(Reduce(union_all, indl[151:length(indl)]))
indiv <- Reduce(union_all, list(indiv1, indiv2, indiv3, indiv4))

In the end, the database only takes 8.5GB (including original small tables and the big concatenated table).

Is this expected? FWIW, I've checked that the SQL commands generated by dplyr are very clean, i.e. a series of (SELECT * FROM TABLE1) UNION ALL .... I was wondering whether some temporary files were not freed as they should in the middle of the operation.

hannes commented 6 years ago

Could you perhaps provide me with a reproducible example? Its possible there is an issue here.

nalimilan commented 6 years ago

Unfortunately, I'm not allowed to share the dataset. As I said, it's made of about 160 individual files with 10,000 to 600,000 rows, for a total of 19M rows. There are 50 variables, most of them strings with a small number of unique values. I can try to generate random data to reproduce the problem, but maybe you already have toy datasets for that?

hannes commented 6 years ago

I don't have a toy dataset, so generated files that provoke the issue would be welcome!

nalimilan commented 6 years ago

Here's a simple example where the database takes 500MB before calling union_all and 1,1GB after, but up to 28GB during the operation:

library(dplyr)
library(dbplyr)
library(MonetDBLite)
x <- sample(LETTERS[1:10], 100000, replace=TRUE)
df <- data.frame(x=x)
for(i in 1:50)
    df[paste0("x", i)] = x

db <- src_monetdblite("test.monetdb", create=TRUE)

tables <- list()
for(i in 1:100)
    tables[[i]] <- copy_to(db, df, paste0("table", i), temporary=FALSE)

total <- compute(Reduce(union_all, tables))

hannes commented 6 years ago

I'm getting

> total <- compute(Reduce(union_all, tables))
Error in UseMethod("compute") : 
  no applicable method for 'compute' applied to an object of class "list"

nalimilan commented 6 years ago

Hmm, weird. Is that with the latest released dplyr/dbplyr? I'm using dplyr_0.7.4 and dbplyr_1.1.0. What's the traceback()?

hannes commented 6 years ago

> traceback()
1: compute(Reduce(union_all, tables))
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MonetDBLite_0.5.0 dbplyr_1.1.0      dplyr_0.7.4

nalimilan commented 6 years ago

Actually that was just because of a silly mistake: should have used tables[[i]]. Not sure why I posted a different code from the one I run. Please try with the updated code above.

hannes commented 6 years ago

confirmed there is something really fishy going on here. Thanks for creating the example!