Closed ritchie46 closed 3 years ago
Thanks Ritchie, it looks great.
I see multiple to_numpy()
calls which, I assume, adds some overhead. Is it necessary to use python for Polars? maybe we could use Rust directly and avoid all kinds of overhead related to python layer on top of it?
How does Polars handle missing values? (both in grouping columns and aggregated measures)
The to_numpy()
isn't expensive in this context. It transforms a single rowed DataFrame to a numpy array.
Perhaps I could replace it later with a Rust version? For now, the python wrappers allow for easier prototyping (and I can reuse the existing logging logic available in your repor), but it will have some overhead indeed.
How does Polars handle missing values? (both in grouping columns and aggregated measures)
The missing values are stored in a separate bitmask array next to the values array.
In the grouping operation they should be ignored and the same counts for the aggregation context (unless the number of nulls per group is queried of course). Is that what you meant?
The to_numpy() isn't expensive in this context. It transforms a single rowed DataFrame to a numpy array.
Oh, right, it is only used for chk
, not the actual queries.
In the grouping operation they should be ignored and the same counts for the aggregation context (unless the number of nulls per group is queried of course). Is that what you meant?
That sounds like a general python style of handling. What about missing value in column that we are grouping on? Is that Unknown group preserved or removed? This is actually the problematic part, at least in pandas and dask.
Perhaps I could replace it later with a Rust version? For now, the python wrappers allow for easier prototyping (and I can reuse the existing logging logic available in your repor), but it will have some overhead indeed.
Definitely agree that python is easier for prototyping. All existing solutions can run interactively, and solution that requires to be compiled introduces a challenge on how to well integrate it. I think over time there will be other solutions that will require compilation, so resolving this now will be beneficial for future. Did you had opportunity to compare py-polars timings with any other solution on your machine? Could you check if py-polars vs rust Polars makes a noticeable difference?
As for the design of benchmark script. I think that could be rust source code, having main to run queries. That will be compiled and run to produce similar output as the other existing scripts.
That sounds like a general python style of handling. What about missing value in column that we are grouping on? Is that Unknown group preserved or removed? This is actually the problematic part, at least in pandas and ask.
Tbh I haven't put much thought in that behavior yet. I agree with what you are saying and I will make missing data groupable next release. EDIT Sorry, I was wrong. Current behavior see missing values as groups!
Could you check if py-polars vs rust Polars makes a noticeable difference?
I didn't experience much difference in local benchmarks as the python wrappers are merely wrappers. Maybe in some cases LLVM can optimize more?
Did you had opportunity to compare py-polars timings with any other solution on your machine?
Yes, I only have the groupby logs at the moment. These were run on a GCE n1-highmem-8 (8 vCPUs, 52 GB memory). https://gist.github.com/ritchie46/4366135d5bdc61bdf69307001a249a99
Thank, merging for now but it will take few more followup commits to have it inside the production pipeline.
@ritchie46 I am getting following errors when importing pypolars
Python 3.6.12 (default, Aug 18 2020, 02:08:22)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pypolars as pl
/home/jan/git/db-benchmark/polars/py-polars/lib/python3.6/site-packages/pypolars
/series.py:6: UserWarning: binary files missing
warnings.warn("binary files missing")
/home/jan/git/db-benchmark/polars/py-polars/lib/python3.6/site-packages/pypolars
/frame.py:12: UserWarning: binary files missing
warnings.warn("binary files missing")
/home/jan/git/db-benchmark/polars/py-polars/lib/python3.6/site-packages/pypolars
/lazy/__init__.py:27: UserWarning: binary files missing
warnings.warn("binary files missing")
any idea what might have gone wrong?
installation output
Collecting psutil
Downloading psutil-5.8.0-cp36-cp36m-manylinux2010_x86_64.whl (291 kB)
|████████████████████████████████| 291 kB 7.1 MB/s
Collecting py-polars
Downloading py_polars-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (6.6 MB)
|████████████████████████████████| 6.6 MB 11.5 MB/s
Collecting numpy
Using cached numpy-1.19.5-cp36-cp36m-manylinux2010_x86_64.whl (14.8 MB)
Installing collected packages: numpy, py-polars, psutil
Successfully installed numpy-1.19.5 psutil-5.8.0 py-polars-0.4.1
Hmm.. Very strange. I have these warnings to indicate you are missing the compiled binary in the wheel. Could you tell me a bit more about your environment?
I tried to reproduce in conda on ubuntu 18.04:
(base) ritchie46:~/code/polars: (bytesIO_parquet)$ conda create -n py36 python=3.6
/opt/miniconda3/lib/python3.7/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.2) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /opt/miniconda3/envs/py36
added / updated specs:
- python=3.6
The following packages will be downloaded:
package | build
---------------------------|-----------------
certifi-2020.12.5 | py36h5fab9bb_1 143 KB conda-forge
python-3.6.12 |hffdb5ce_0_cpython 38.4 MB conda-forge
setuptools-49.6.0 | py36h5fab9bb_3 936 KB conda-forge
------------------------------------------------------------
Total: 39.4 MB
The following NEW packages will be INSTALLED:
_libgcc_mutex conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
_openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-1_gnu
ca-certificates conda-forge/linux-64::ca-certificates-2020.12.5-ha878542_0
certifi conda-forge/linux-64::certifi-2020.12.5-py36h5fab9bb_1
ld_impl_linux-64 conda-forge/linux-64::ld_impl_linux-64-2.35.1-hea4e1c9_1
libffi conda-forge/linux-64::libffi-3.3-h58526e2_2
libgcc-ng conda-forge/linux-64::libgcc-ng-9.3.0-h5dbcf3e_17
libgomp conda-forge/linux-64::libgomp-9.3.0-h5dbcf3e_17
libstdcxx-ng conda-forge/linux-64::libstdcxx-ng-9.3.0-h2ae2ef3_17
ncurses conda-forge/linux-64::ncurses-6.2-h58526e2_4
openssl conda-forge/linux-64::openssl-1.1.1i-h7f98852_0
pip conda-forge/noarch::pip-20.3.3-pyhd8ed1ab_0
python conda-forge/linux-64::python-3.6.12-hffdb5ce_0_cpython
python_abi conda-forge/linux-64::python_abi-3.6-1_cp36m
readline conda-forge/linux-64::readline-8.0-he28a2e2_2
setuptools conda-forge/linux-64::setuptools-49.6.0-py36h5fab9bb_3
sqlite conda-forge/linux-64::sqlite-3.34.0-h74cdb3f_0
tk conda-forge/linux-64::tk-8.6.10-h21135ba_1
wheel conda-forge/noarch::wheel-0.36.2-pyhd3deb0d_0
xz conda-forge/linux-64::xz-5.2.5-h516909a_1
zlib conda-forge/linux-64::zlib-1.2.11-h516909a_1010
Proceed ([y]/n)? y
Downloading and Extracting Packages
python-3.6.12 | 38.4 MB | ################################################################# | 100%
certifi-2020.12.5 | 143 KB | ################################################################# | 100%
setuptools-49.6.0 | 936 KB | ################################################################# | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate py36
#
# To deactivate an active environment, use
#
# $ conda deactivate
(base) ritchie46:~/code/polars: (bytesIO_parquet)$ conda activate py36
(py36) ritchie46:~/code/polars: (bytesIO_parquet)$ pip install py-polars
Collecting py-polars
Downloading py_polars-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (6.6 MB)
|████████████████████████████████| 6.6 MB 5.8 MB/s
Collecting numpy
Downloading numpy-1.19.5-cp36-cp36m-manylinux2010_x86_64.whl (14.8 MB)
|████████████████████████████████| 14.8 MB 5.9 MB/s
Installing collected packages: numpy, py-polars
Successfully installed numpy-1.19.5 py-polars-0.4.1
(py36) ritchie46:~/code/polars: (bytesIO_parquet)$ python
Python 3.6.12 | packaged by conda-forge | (default, Dec 9 2020, 00:36:02)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pypolars as pl
>>>
Thank, merging for now but it will take few more followup commits to have it inside the production pipeline.
Great! Let me know what I can do. :+1:
Problem occurs on Ubuntu 16.04 but not on Ubuntu 18.04.
docker run -it --rm ubuntu:16.04 /bin/bash
apt-get install software-properties-common
add-apt-repository ppa:deadsnakes/ppa
apt-get update
apt-get install python3.6-dev virtualenv
virtualenv polars/py-polars --python=/usr/bin/python3.6
source polars/py-polars/bin/activate
python -m pip install --upgrade psutil py-polars
Collecting psutil
Downloading psutil-5.8.0-cp36-cp36m-manylinux2010_x86_64.whl (291 kB)
|################################| 291 kB 6.5 MB/s
Collecting py-polars
Downloading py_polars-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (6.6 MB)
|################################| 6.6 MB 8.3 MB/s
Collecting numpy
Downloading numpy-1.19.5-cp36-cp36m-manylinux2010_x86_64.whl (14.8 MB)
|################################| 14.8 MB 28.6 MB/s
Installing collected packages: numpy, py-polars, psutil
Successfully installed numpy-1.19.5 psutil-5.8.0 py-polars-0.4.1
(py-polars) root@18dd3226dd1b:/# python
Python 3.6.12 (default, Aug 18 2020, 02:08:22)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pypolars as pl
/polars/py-polars/lib/python3.6/site-packages/pypolars/series.py:6: UserWarning: binary files missing
warnings.warn("binary files missing")
/polars/py-polars/lib/python3.6/site-packages/pypolars/frame.py:12: UserWarning: binary files missing
warnings.warn("binary files missing")
/polars/py-polars/lib/python3.6/site-packages/pypolars/lazy/__init__.py:27: UserWarning: binary files missing
warnings.warn("binary files missing")
Is there a way to make Polars support 16.04?
Is there a way to make Polars support 16.04?
Thanks for pointing this out. It turns out I need to use the manylinux docker image to build for more linux distro's will fix this ASAP.
EDIT: I believe this is fixed now.
Polars is on the report already. It has very competitive timings, congratulation @ritchie46!
Be sure to check https://h2oai.github.io/db-benchmark/#explore-more-data-cases at the bottom of the report. There are more benchmark plots linked there, having different cardinality of id columns, missing values and being pre-sorted.
We also have https://h2oai.github.io/db-benchmark/history.html for tracking performance regression, but it is an internal only report, not meant to be published really. It is because there are many factors that are affecting timings which cannot be reliably presented. Recent examples
Other factors could be kernel version, compiler version, etc. All might be changing over time and are not being tracked what was the version when query timings got recorded. So for public consumption only "latest" timings are meant to be presented on the main report, but for developers it can be very useful to look at history plots when they are presume performance regression.
Thanks for joining this project!
Polars is on the report already. It has very competitive timings, congratulation @ritchie46!
Very nice! Better than I expected. :smile:
I will first solve the issues on the 50GB segfault and make the join algorithm parallel. After that I will come back to you for the Rust native solution.
See #163.
I don't know if this is enough to make it work? Let me know what needs to change in case needed. I have tested both the groupby and join solution on 0.5GB and 5GB.