h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

try arrow/feather format for speeding up data load #105

Closed jangorecki closed 3 years ago

jangorecki commented 4 years ago

R arrow package is now on CRAN. With using feather in the past we were getting segfaults. In R it didn't even work for 1e7, in python 1e9. We should try new arrow format, might speed up reading data. related issues: https://github.com/h2oai/db-benchmark/issues/47 https://github.com/h2oai/db-benchmark/issues/99

jangorecki commented 4 years ago

First try gives quite nice speed up. The problem is the installation process, which seems to be very problematic, see https://github.com/apache/arrow/issues/6435. It is easy to install using conda

conda create --name arrow
conda activate arrow
conda install --name arrow -c conda-forge r-arrow

but that interfere with our current virtualenv/condaenv/R-libs, thus is not a way to go.

jangorecki commented 3 years ago

Looking at arrow docs now it seems that write/read arrow file will use a feather format . Interesting that this format we already tried in the past and it didn't work quite well: https://github.com/h2oai/db-benchmark/issues/47#issuecomment-443438216 Now feather has a v2 option (Apache Arrow IPC file format), worth to check, maybe segfaults has been fixed now and we can use it.

jangorecki commented 3 years ago

From R it seems to work fine. If from python is good as well then we can incorporate it.

R -q
library(data.table)
setDTthreads(0L)
system.time(df <- fread("data/G1_1e9_1e2_0_0.csv", showProgress=TRUE, stringsAsFactors=TRUE, data.table=FALSE))
library(arrow)
write_feather(df,"G1_1e9_1e2_0_0.fea")
str(df)
q("no")
R -q
library(arrow)
system.time(d <- read_feather("G1_1e9_1e2_0_0.fea"))
str(d)
q("no")
jangorecki commented 3 years ago

code to create feather files for all our csv currently in use

library(arrow)
library(data.table)
setDTthreads(0L)
g1 = sprintf("G1_%s_%s_0_%s", rep(c("1e7","1e8","1e9"), each=4L), rep(c("1e2","1e1","2e0","1e2"), times=3L), rep(c("0","0","0","1"), times=3L))
j1 = sprintf("J1_%s_%s_0_0", rep(c("1e7","1e8","1e9"), each=4L), trimws(gsub("e+0","e", format(c(sapply(c(1e7,1e8,1e9), `/`, c(NA,1e6,1e3,1e0))), digits=1), fixed=TRUE)))
csv2fea = function(dn) {
  cat("csv2fea:", dn, "\n") 
  df = fread(sprintf("data/%s.csv", dn), showProgress=FALSE, stringsAsFactors=TRUE, data.table=FALSE)
  arrow::write_feather(df, sprintf("data/%s.fea", dn))
  rm(df)
  TRUE
}
sapply(c(g1, j1), csv2fea)
jangorecki commented 3 years ago

This change is for now blocked by arrow's Capacity error "array cannot contain more than 2147483646 bytes" described in: https://github.com/apache/arrow/issues/8732 (it is limitation in R arrow package which is planned to be addressed 3.0.0) Error happens when trying to write G1_1e9_2e0_0_0 dataset.

Moreover, there are some issues among different software about using arrow files:

Code related to that change will stay in https://github.com/h2oai/db-benchmark/tree/arrow branch for now.

nealrichardson commented 3 years ago

Have you considered using Parquet files instead of Feather? Parquet files are generally smaller on disk and also very fast to load. See https://ursalabs.org/blog/2020-feather-v2/ and https://arrow.apache.org/faq/ for some discussion of the tradeoffs. (Arrow and Parquet are closely related, and the C++/Python/R implementations of Parquet are part of the Arrow libraries.)

I'm not sure about the cudf and clickhouse issues you note above, but Dask and Spark can read Parquet files directly. Regarding the Arrow R package issue you reported, Parquet doesn't help that because the data goes through the Arrow format to get to Parquet.

jangorecki commented 3 years ago

@nealrichardson Thank you for your comment and useful links. I did used parquet before. Bigger data in parquet files (corresponding to 50GB csv) written from python were not able to be read in Spark. In that sense they were not well portable. Portability is a plus because I can use same file by multiple different tools. Although Spark doesn't seem to read Arrow directly, the Arrow format is meant to be "cross platform" exchange format, so in theory Arrow is the way to go in my use case. Loading speed is big concern as long as it is faster than reading csv. Moreover I have many datasets and it is easier to maintain them having 1 dataset = 1 file structure, and not directories like parquet. Aside from having data files to read into memory there is another aspect of being able to mmap and compute out-of-memory queries. Arrow seems to fit well here. I had parquet out-of-memory computation working on Dask but in recent version 2.30.0 it does not compute even single query anymore.

jangorecki commented 3 years ago

There is no much to do in this issue. At least until R arrow 3.0.0 will be release. For details see https://github.com/apache/arrow/issues/8732 As for suggested alternative:

You could also generate the data from Python (pyarrow) and I believe it would handle the chunking correctly.

Unfortunately it is not possible to read 50GB csv data on 128GB mem machine due to memory required by python. Therefore I cannot generate arrow files with pyarrow in case of all 1e9 data size cases. Note that it is a second attempt to use arrow/feather files. Previous attempt (feather v1) failed due to segfaults.