h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
327 stars 89 forks source link

Use binary load for Dask and Pandas grouping tests #47

Closed mattdowle closed 5 years ago

mattdowle commented 5 years ago

Currently the Dask and Pandas grouping tests are shown as a fail at the 50GB size, where other products work. But this is only because reading the csv file fails, which isn't to do with grouping per se. These grouping tests could instead use pickle or feather to load the dataset. It's not like the time to load the data is included in the test anyway. It would also be faster to run the grouping tests since the time to read from csv would not need to happen first. Reading data from csv is due to be added to db-bench as a separate set of tests where the fail point would be fairly represented there separately.

Similarly, pydatatable could read the test data from its memory map before grouping, and data.table could read from fst. So long as the result of reading from these binary formats was just the same as if read from csv (so no pre-computed data like indexes or similar allowed (**)) then it would be faster for db-bench to run as well as getting a timing for Dask and Pandas which probably do in fact work at this size on this machine.

(**) separate tests to be added in future where pre-computed indexes and similar are allowed.

With this done, https://github.com/h2oai/db-benchmark/issues/45 could be enabled again.

st-pasha commented 5 years ago
pandas_dataframe = dt.fread(srcfile).to_pandas()
jangorecki commented 5 years ago

@st-pasha approach is actually what we use for dplyr, to use R data.table for reading data. In python it is a little bit more complicated because each python solution has own virtualenv. Will check binary format first.

st-pasha commented 5 years ago

@jangorecki This is actually something that I was wondering about. Is it necessary to have a separate virtualenv for each python library? Are there any that cannot coexist in a single virtualenv? If we could remove the requirement for separate virtualenv for each process, it would greatly simplify the benchmarking process.

jangorecki commented 5 years ago

Agree, but dask (and eventually other solutions) requires particular version of pandas. Initially I started to use single virtualenv for all stuff but then I got conflicts in versions.

jangorecki commented 5 years ago

@st-pasha any suggestion on binary format to use for python? There doesn't seems to be any single one that will be good: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization In R we have built-in RDS and it works great.

st-pasha commented 5 years ago

I'm looking at Dask's setup.py, and they have only minimum required version for all packages, no maximum. So it should work with the most recent version of pandas. In fact, I just tried installing the most recent Dask (0.20.1), and it works fine with the most recent pandas (0.23.4). Do you have a different experience with the bleeding-edge versions of either pandas or dask?

If there are any incompatibilities, then it is something worth raising an issue about upstream.

Regarding binary format, there is no single-best solution (as evident from the link you posted). pickle is the standard-library solution, but it works best for python-only objects. datatable uses Jay format to save binary files (frame.save(filename, format="jay")), pandas prefers feather files (or arrow). It all depends on what your goal is.

For example, if you want to avoid reading huge CSV file with pandas, then one possibility is to say datatable.fread(file).to_pandas(). Or, you can read the file beforehand with datatable, then save it into Jay, then open with datatable and convert to pandas. Or, you can read the file with datatable, convert into pandas, save into arrow, and later open the arrow file with pandas.

jangorecki commented 5 years ago

@st-pasha OK I found it was not dask but modin: https://github.com/h2oai/db-benchmark/commit/19cc60c5c42efd99ca9a52b922ab2a44ebe75b29 such incompatibility can arise at any point for any python solutions. We now use also pyspark, and soon hopefully rapids cudf (currently requires conda envs). In R there is CRAN that prevents such issues. The goal is to not run out of memory when reading 1e9 files, and ideally read it fast. So feather seems to be good option, will work for dask also, and those two solutions are currently failing to read 1e9 data.

st-pasha commented 5 years ago

@jangorecki Ah, I see. Indeed, they specify an exact version of pandas in their setup.py. Luckily, it is the latest version (0.23.4). Or does it not work if you install the development version of pandas?

In any case, we could probably ask them to change the version to pandas>=0.23.4.

jangorecki commented 5 years ago

Previously it required 0.22 when 0.23.2 was out already

jangorecki commented 5 years ago

for references scripts used to produce binary formats

library(data.table)
library(fst)
files = outer(c("1e7","1e8","1e9"), c("1e2","1e1","2e0"), paste, sep="_")
files = sprintf("G1_%s.csv", c(
  paste(files, "0", "0", sep="_"),
  paste(files[,1], "0", "1", sep="_")
))
# for python: cat(paste0("files=[",paste(paste0("'",files,"'"), collapse=","), "]"))

for (file in files) {
  cat("fread", file, "\n")
  print(system.time(df<-fread(file, stringsAsFactors=TRUE, data.table=FALSE, showProgress=FALSE)))
  ofile = gsub("csv","fst",file,fixed=TRUE)
  cat("write.fst", ofile, "\n")
  write.fst(df, ofile)
}
cat("done\n")
if (!interactive()) q("no")
import datatable as dt
import pandas as pd
import pickle
import re

files=['G1_1e7_1e2_0_0.csv','G1_1e8_1e2_0_0.csv','G1_1e9_1e2_0_0.csv','G1_1e7_1e1_0_0.csv','G1_1e8_1e1_0_0.csv','G1_1e9_1e1_0_0.csv','G1_1e7_2e0_0_0.csv','G1_1e8_2e0_0_0.csv','G1_1e9_2e0_0_0.csv','G1_1e7_1e2_0_1.csv','G1_1e8_1e2_0_1.csv','G1_1e9_1e2_0_1.csv']

for file in files:
    print("fread %s" % file)
    x = dt.fread(file).to_pandas()
    x['id1'] = x['id1'].astype('category')
    x['id2'] = x['id2'].astype('category')
    x['id3'] = x['id3'].astype('category')
    ofile = re.sub("csv", "pkl", file)
    print("write %s" % ofile)
    pd.to_pickle(x, ofile)

print("done")
library(data.table)
library(feather)

files = outer(c("1e7","1e8"), c("1e2","1e1","2e0"), paste, sep="_") #,"1e9" is on another machine
#files = outer(c("1e9"), c("1e2","1e1","2e0"), paste, sep="_")
files = sprintf("G1_%s.csv", c(
  paste(files, "0", "0", sep="_"),
  paste(files[,1], "0", "1", sep="_")
))

for (file in files) {
  cat("fread", file, "\n")
  print(system.time(df<-fread(file, stringsAsFactors=TRUE, data.table=FALSE, showProgress=FALSE)))
  ofile = gsub("csv","fea",file,fixed=TRUE)
  cat("write_feather", ofile, "\n")
  write_feather(df, ofile)
}
cat("done\n")
jangorecki commented 5 years ago

Surprisingly there is no direct API for loading feather/arrow from most of the solutions. It is now currently added for data.table, dplyr and pandas. Other tools will follow, status for each can be looked up in https://github.com/h2oai/db-benchmark/blob/9d59592688a62aecedf4f2b695cd552747e6d131/launcher.R#L58-L65 Closing this issue to not leave it stale blocked by other projects. What could have been accomplished now was done in https://github.com/h2oai/db-benchmark/commit/2b4d3bd8b91868d1fd66d674c6c13ee20fa39322

jangorecki commented 5 years ago

For both for dplyr and data.table where feather R package is used I am getting Error: C stack usage 7971012 is too close to the limit. Will try fst package instead https://github.com/h2oai/db-benchmark/commit/324724bf135a6322dacadbde599b0135a17be5db. Feather still works for pandas in python, not sure yet if for 1e9 too.

mattdowle commented 5 years ago

Frustrating! Aside: I don't see an open issue(s) for reading from binary for the other tools. Ok to close this one, but then another needs to be open for those then?

jangorecki commented 5 years ago

Feather doesn't seems to be good idea... for R it doesn't even work for 1e7 rows data and in python it segfaults on 1e9

tmp = feather.read_dataframe('data/G1_1e9_1e2_0_0.fea')
#Segmentation fault (core dumped)

reading feather to pandas directly used to work but it got broken due to dependency update https://github.com/pandas-dev/pandas/issues/23053

re-opening this issue, I will close when no further works will be planned. for now R will use fst and python is TODO

st-pasha commented 5 years ago

For pandas you can use

import datatable as dt
pandas_frame = dt.open("data/G1_1e9_1e2_0_0.jay").to_pandas()

If virtual environments need to be kept separate, you can always pip-install the latest stable datatable version (0.7.0) into the pandas env.

jangorecki commented 5 years ago

@st-pasha yes this is what I will have to do. But the plan was to use some format I can re-use in other tools, without going through pandas. So now loading data for dask will be datatable-pandas-dask.

jangorecki commented 5 years ago

@st-pasha unfortunately datatable will not help much

>>> x = dt.open(os.path.join("data", src_grp)).to_pandas() # src_grp="G1_1e9_1e2_0_0.jay"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/datatable/frame.py", line 450, in to_pandas
    x = srcdt.window(0, self.nrows, i, i + 1).data[0]
MemoryError

So the next option is to try pickle...

st-pasha commented 5 years ago

@jangorecki pickle won't help: the column simply doesn't fit into existing memory (at least when it is represented as pyobjects, which is what pandas uses for string columns).

jangorecki commented 5 years ago

It appears that binary formats will not help with memory errors of pandas and dask on 50gb input data. Pandas use jay binary format from py datatable. Dask still uses csv, we could use jay-pandas-dask but importing from pandas to dask requires to provide number of partitions, unlike when reading from csv, which is, I believe, data dependent, thus I would prefer to stay away from data investigation but leave dask as is. Closing issue as there are not actions defined any more. Feel free to re-open and provide scenarios to be checked.