Open nalimilan opened 6 years ago
Could you perhaps provide me with a reproducible example? Its possible there is an issue here.
Unfortunately, I'm not allowed to share the dataset. As I said, it's made of about 160 individual files with 10,000 to 600,000 rows, for a total of 19M rows. There are 50 variables, most of them strings with a small number of unique values. I can try to generate random data to reproduce the problem, but maybe you already have toy datasets for that?
I don't have a toy dataset, so generated files that provoke the issue would be welcome!
Here's a simple example where the database takes 500MB before calling union_all
and 1,1GB after, but up to 28GB during the operation:
library(dplyr)
library(dbplyr)
library(MonetDBLite)
x <- sample(LETTERS[1:10], 100000, replace=TRUE)
df <- data.frame(x=x)
for(i in 1:50)
df[paste0("x", i)] = x
db <- src_monetdblite("test.monetdb", create=TRUE)
tables <- list()
for(i in 1:100)
tables[[i]] <- copy_to(db, df, paste0("table", i), temporary=FALSE)
total <- compute(Reduce(union_all, tables))
I'm getting
> total <- compute(Reduce(union_all, tables))
Error in UseMethod("compute") :
no applicable method for 'compute' applied to an object of class "list"
Hmm, weird. Is that with the latest released dplyr/dbplyr? I'm using dplyr_0.7.4 and dbplyr_1.1.0. What's the traceback()
?
> traceback()
1: compute(Reduce(union_all, tables))
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MonetDBLite_0.5.0 dbplyr_1.1.0 dplyr_0.7.4
Actually that was just because of a silly mistake: should have used tables[[i]]
. Not sure why I posted a different code from the one I run. Please try with the updated code above.
confirmed there is something really fishy going on here. Thanks for creating the example!
I need to concatenate about 160 tables stored in a MonetDBLite database into a single table. I used to do so with the SQLite backend to
dplyr
, like this:Reduce(union_all, tables_list)
. Unfortunately, this doesn't work with MonetDBLite: the database grows from 3.5GB to 70GB, and the merge fails after there's no free disk space left.I've eventually found a workaround by splitting the operation in smaller parts:
In the end, the database only takes 8.5GB (including original small tables and the big concatenated table).
Is this expected? FWIW, I've checked that the SQL commands generated by dplyr are very clean, i.e. a series of
(SELECT * FROM TABLE1) UNION ALL ...
. I was wondering whether some temporary files were not freed as they should in the middle of the operation.