hannes / MonetDBLite

MonetDB reconfigured as a library
108 stars 11 forks source link

fast CSV export #188

Closed inkrement closed 7 years ago

inkrement commented 7 years ago

I want to export a table (2 columns, ~100GB, 700m rows) to a CSV file. The official documentation recommends using the COPY-command so I tried it like this:

DBI::dbSendQuery(con, "COPY SELECT * FROM comments_en INTO '/home/chris/long/path/comments_en.csv'")

But the query is running now for more than a day, the output file is still empty and the rsession uses a single core to 100% (the machine has 64 cores and the HDD provides 600MB/s at reading and writing so this should be a thing of minutes). Is there a better way to export a table to a file or how can I optimise it?

PedroTadim commented 7 years ago

The CSV export functionality is not working properly on MonetDBLite. I found a possible problem and I am fixing it for the Java version. Later on we can try to merge it with the other versions.

inkrement commented 7 years ago

Ok, thanks. Then I'll try to load the whole table into a data.frame (we should have enough memory for doing that) and then dump it into a csv file using the readr package... let's see if it works.

inkrement commented 7 years ago

@PedroTadim I tried to read and process the whole table it in chunks like this:

rs <- dbSendQuery(con, "SELECT * FROM comments_en")

i <- 0
while (!dbHasCompleted(rs)) {
  print(paste0('load next batch: ', i))
  chunk <- dbFetch(rs, 5000000)

   // do some fancy stuff
}

But the SendQuery command is also really slow (it again runs for more than 2 hours now). Is there another option to read a big table in batches?

PedroTadim commented 7 years ago

You can use LIMIT and OFFSET statements in the select query to fecth a portion of the table at the time. Also avoid to use the * operator so the compilation of the query will go slightly faster.

inkrement commented 7 years ago

Do you know if it slows down when offset becomes higher, or does it utilise indices to jump over those blocks? or are there any other options to export the data?

PedroTadim commented 7 years ago

I am studying the MonetDB kernel everyday, but I still don't know it very deeply yet :( MonetDB uses a very efficient row indexing scheme as far I know, however it relies on memory mapping to do so. Which platform are you using? The memory mapping functionality has some performance issues on Windows which impacts largely on MonetDB. In MonetDB there are other options to export like msqldump, but in MonetDBLite we are constrained to the SQL parser, so the CSV export is the only option as far I know, besides manipulating the farm's files. I also fixed the CSV export functionality in the Java version, so @hannesmuehleisen and @Mytherin can merge the changes.

inkrement commented 7 years ago

I tested it and it looks like the execution time of a query with increasing offset gets quite stable after some time: rplot

I ll close this issue for now, but CSV-export would be a nice feature anyways.