h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

polars solution #178

Closed ritchie46 closed 3 years ago

ritchie46 commented 3 years ago

See #163.

I don't know if this is enough to make it work? Let me know what needs to change in case needed. I have tested both the groupby and join solution on 0.5GB and 5GB.

jangorecki commented 3 years ago

Thanks Ritchie, it looks great. I see multiple to_numpy() calls which, I assume, adds some overhead. Is it necessary to use python for Polars? maybe we could use Rust directly and avoid all kinds of overhead related to python layer on top of it? How does Polars handle missing values? (both in grouping columns and aggregated measures)

ritchie46 commented 3 years ago

The to_numpy() isn't expensive in this context. It transforms a single rowed DataFrame to a numpy array.

Perhaps I could replace it later with a Rust version? For now, the python wrappers allow for easier prototyping (and I can reuse the existing logging logic available in your repor), but it will have some overhead indeed.

How does Polars handle missing values? (both in grouping columns and aggregated measures)

The missing values are stored in a separate bitmask array next to the values array.

In the grouping operation they should be ignored and the same counts for the aggregation context (unless the number of nulls per group is queried of course). Is that what you meant?

jangorecki commented 3 years ago

The to_numpy() isn't expensive in this context. It transforms a single rowed DataFrame to a numpy array.

Oh, right, it is only used for chk, not the actual queries.

In the grouping operation they should be ignored and the same counts for the aggregation context (unless the number of nulls per group is queried of course). Is that what you meant?

That sounds like a general python style of handling. What about missing value in column that we are grouping on? Is that Unknown group preserved or removed? This is actually the problematic part, at least in pandas and dask.

Perhaps I could replace it later with a Rust version? For now, the python wrappers allow for easier prototyping (and I can reuse the existing logging logic available in your repor), but it will have some overhead indeed.

Definitely agree that python is easier for prototyping. All existing solutions can run interactively, and solution that requires to be compiled introduces a challenge on how to well integrate it. I think over time there will be other solutions that will require compilation, so resolving this now will be beneficial for future. Did you had opportunity to compare py-polars timings with any other solution on your machine? Could you check if py-polars vs rust Polars makes a noticeable difference?

As for the design of benchmark script. I think that could be rust source code, having main to run queries. That will be compiled and run to produce similar output as the other existing scripts.

ritchie46 commented 3 years ago

That sounds like a general python style of handling. What about missing value in column that we are grouping on? Is that Unknown group preserved or removed? This is actually the problematic part, at least in pandas and ask.

Tbh I haven't put much thought in that behavior yet. I agree with what you are saying and I will make missing data groupable next release. EDIT Sorry, I was wrong. Current behavior see missing values as groups!

Could you check if py-polars vs rust Polars makes a noticeable difference?

I didn't experience much difference in local benchmarks as the python wrappers are merely wrappers. Maybe in some cases LLVM can optimize more?

Did you had opportunity to compare py-polars timings with any other solution on your machine?

Yes, I only have the groupby logs at the moment. These were run on a GCE n1-highmem-8 (8 vCPUs, 52 GB memory). https://gist.github.com/ritchie46/4366135d5bdc61bdf69307001a249a99

jangorecki commented 3 years ago

Thank, merging for now but it will take few more followup commits to have it inside the production pipeline.

jangorecki commented 3 years ago

@ritchie46 I am getting following errors when importing pypolars

Python 3.6.12 (default, Aug 18 2020, 02:08:22) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pypolars as pl
/home/jan/git/db-benchmark/polars/py-polars/lib/python3.6/site-packages/pypolars
/series.py:6: UserWarning: binary files missing
  warnings.warn("binary files missing")
/home/jan/git/db-benchmark/polars/py-polars/lib/python3.6/site-packages/pypolars
/frame.py:12: UserWarning: binary files missing
  warnings.warn("binary files missing")
/home/jan/git/db-benchmark/polars/py-polars/lib/python3.6/site-packages/pypolars
/lazy/__init__.py:27: UserWarning: binary files missing
  warnings.warn("binary files missing")

any idea what might have gone wrong?

installation output

Collecting psutil
  Downloading psutil-5.8.0-cp36-cp36m-manylinux2010_x86_64.whl (291 kB)
     |████████████████████████████████| 291 kB 7.1 MB/s 
Collecting py-polars
  Downloading py_polars-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (6.6 MB)
     |████████████████████████████████| 6.6 MB 11.5 MB/s 
Collecting numpy
  Using cached numpy-1.19.5-cp36-cp36m-manylinux2010_x86_64.whl (14.8 MB)
Installing collected packages: numpy, py-polars, psutil
Successfully installed numpy-1.19.5 psutil-5.8.0 py-polars-0.4.1
ritchie46 commented 3 years ago

Hmm.. Very strange. I have these warnings to indicate you are missing the compiled binary in the wheel. Could you tell me a bit more about your environment?

I tried to reproduce in conda on ubuntu 18.04:

(base) ritchie46:~/code/polars:  (bytesIO_parquet)$ conda create -n py36 python=3.6
/opt/miniconda3/lib/python3.7/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.2) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/miniconda3/envs/py36

  added / updated specs:
    - python=3.6

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2020.12.5          |   py36h5fab9bb_1         143 KB  conda-forge
    python-3.6.12              |hffdb5ce_0_cpython        38.4 MB  conda-forge
    setuptools-49.6.0          |   py36h5fab9bb_3         936 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        39.4 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_gnu
  ca-certificates    conda-forge/linux-64::ca-certificates-2020.12.5-ha878542_0
  certifi            conda-forge/linux-64::certifi-2020.12.5-py36h5fab9bb_1
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.35.1-hea4e1c9_1
  libffi             conda-forge/linux-64::libffi-3.3-h58526e2_2
  libgcc-ng          conda-forge/linux-64::libgcc-ng-9.3.0-h5dbcf3e_17
  libgomp            conda-forge/linux-64::libgomp-9.3.0-h5dbcf3e_17
  libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-9.3.0-h2ae2ef3_17
  ncurses            conda-forge/linux-64::ncurses-6.2-h58526e2_4
  openssl            conda-forge/linux-64::openssl-1.1.1i-h7f98852_0
  pip                conda-forge/noarch::pip-20.3.3-pyhd8ed1ab_0
  python             conda-forge/linux-64::python-3.6.12-hffdb5ce_0_cpython
  python_abi         conda-forge/linux-64::python_abi-3.6-1_cp36m
  readline           conda-forge/linux-64::readline-8.0-he28a2e2_2
  setuptools         conda-forge/linux-64::setuptools-49.6.0-py36h5fab9bb_3
  sqlite             conda-forge/linux-64::sqlite-3.34.0-h74cdb3f_0
  tk                 conda-forge/linux-64::tk-8.6.10-h21135ba_1
  wheel              conda-forge/noarch::wheel-0.36.2-pyhd3deb0d_0
  xz                 conda-forge/linux-64::xz-5.2.5-h516909a_1
  zlib               conda-forge/linux-64::zlib-1.2.11-h516909a_1010

Proceed ([y]/n)? y

Downloading and Extracting Packages
python-3.6.12        | 38.4 MB   | ################################################################# | 100% 
certifi-2020.12.5    | 143 KB    | ################################################################# | 100% 
setuptools-49.6.0    | 936 KB    | ################################################################# | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate py36
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) ritchie46:~/code/polars:  (bytesIO_parquet)$ conda activate py36
(py36) ritchie46:~/code/polars:  (bytesIO_parquet)$ pip install py-polars
Collecting py-polars
  Downloading py_polars-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (6.6 MB)
     |████████████████████████████████| 6.6 MB 5.8 MB/s 
Collecting numpy
  Downloading numpy-1.19.5-cp36-cp36m-manylinux2010_x86_64.whl (14.8 MB)
     |████████████████████████████████| 14.8 MB 5.9 MB/s 
Installing collected packages: numpy, py-polars
Successfully installed numpy-1.19.5 py-polars-0.4.1
(py36) ritchie46:~/code/polars:  (bytesIO_parquet)$ python
Python 3.6.12 | packaged by conda-forge | (default, Dec  9 2020, 00:36:02) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pypolars as pl
>>> 

Thank, merging for now but it will take few more followup commits to have it inside the production pipeline.

Great! Let me know what I can do. :+1:

jangorecki commented 3 years ago

Problem occurs on Ubuntu 16.04 but not on Ubuntu 18.04.

docker run -it --rm ubuntu:16.04 /bin/bash
apt-get install software-properties-common
add-apt-repository ppa:deadsnakes/ppa 
apt-get update
apt-get install python3.6-dev virtualenv

virtualenv polars/py-polars --python=/usr/bin/python3.6
source polars/py-polars/bin/activate

python -m pip install --upgrade psutil py-polars
Collecting psutil
  Downloading psutil-5.8.0-cp36-cp36m-manylinux2010_x86_64.whl (291 kB)
     |################################| 291 kB 6.5 MB/s 
Collecting py-polars
  Downloading py_polars-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (6.6 MB)
     |################################| 6.6 MB 8.3 MB/s 
Collecting numpy
  Downloading numpy-1.19.5-cp36-cp36m-manylinux2010_x86_64.whl (14.8 MB)
     |################################| 14.8 MB 28.6 MB/s 
Installing collected packages: numpy, py-polars, psutil
Successfully installed numpy-1.19.5 psutil-5.8.0 py-polars-0.4.1
(py-polars) root@18dd3226dd1b:/# python 
Python 3.6.12 (default, Aug 18 2020, 02:08:22) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pypolars as pl
/polars/py-polars/lib/python3.6/site-packages/pypolars/series.py:6: UserWarning: binary files missing
  warnings.warn("binary files missing")
/polars/py-polars/lib/python3.6/site-packages/pypolars/frame.py:12: UserWarning: binary files missing
  warnings.warn("binary files missing")
/polars/py-polars/lib/python3.6/site-packages/pypolars/lazy/__init__.py:27: UserWarning: binary files missing
  warnings.warn("binary files missing")

Is there a way to make Polars support 16.04?

ritchie46 commented 3 years ago

Is there a way to make Polars support 16.04?

Thanks for pointing this out. It turns out I need to use the manylinux docker image to build for more linux distro's will fix this ASAP.

EDIT: I believe this is fixed now.

jangorecki commented 3 years ago

Polars is on the report already. It has very competitive timings, congratulation @ritchie46!

Be sure to check https://h2oai.github.io/db-benchmark/#explore-more-data-cases at the bottom of the report. There are more benchmark plots linked there, having different cardinality of id columns, missing values and being pre-sorted.

We also have https://h2oai.github.io/db-benchmark/history.html for tracking performance regression, but it is an internal only report, not meant to be published really. It is because there are many factors that are affecting timings which cannot be reliably presented. Recent examples

Other factors could be kernel version, compiler version, etc. All might be changing over time and are not being tracked what was the version when query timings got recorded. So for public consumption only "latest" timings are meant to be presented on the main report, but for developers it can be very useful to look at history plots when they are presume performance regression.

Thanks for joining this project!

ritchie46 commented 3 years ago

Polars is on the report already. It has very competitive timings, congratulation @ritchie46!

Very nice! Better than I expected. :smile:

I will first solve the issues on the 50GB segfault and make the join algorithm parallel. After that I will come back to you for the Rust native solution.