h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

Polars #179

Closed ritchie46 closed 3 years ago

ritchie46 commented 3 years ago

Change some syntax to be more consistent with earlier queries.

And fix a bug in the join script. When casting to categoricals there is a temporarily global string cache to make the categories consistent. This cache exists until the context_manager is closed.

jangorecki commented 3 years ago

Thanks!

jangorecki commented 3 years ago

@ritchie46 With the recent update to 0.5.5 first grouping query fails with

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', po
lars/polars-core/src/chunked_array/mod.rs:422:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "./polars/groupby-polars.py", line 41, in <module>
    ans = x.groupby("id1").agg(pl.sum("v1")).collect()
  File "/home/jan/git/db-benchmark/polars/py-polars/lib/python3.6/site-packages/
pypolars/lazy/__init__.py", line 213, in collect
    return wrap_df(ldf.collect())
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

on all data sizes

ritchie46 commented 3 years ago

You just got ahead of me. This issue is fixed in the new https://pypi.org/project/py-polars/0.6.0 release. I can confirm that the groupby runs locally on the 0.5 GB set. This release also fixes #188.