h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

py-polars was renamed to polars; continue py-polars as polars #192

Closed ritchie46 closed 3 years ago

ritchie46 commented 3 years ago

We renamed py-polars to polars. This PR points to the new pypi registry. The old one won't be updated.

ritchie46 commented 3 years ago

FYI, the old code will still work for the time being, as we will proxy to the new package name.

jangorecki commented 3 years ago

I will hopefully merge this week.

jangorecki commented 3 years ago

@ritchie46 all groupby scripts fails at question2 with

Traceback (most recent call last):
  File "./polars/groupby-polars.py", line 60, in <module>
    print(ans.head(3), flush=True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-33: ord
inal not in range(128)
ritchie46 commented 3 years ago

@ritchie46 all groupby scripts fails at question2 with

Traceback (most recent call last):
  File "./polars/groupby-polars.py", line 60, in <module>
    print(ans.head(3), flush=True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-33: ord
inal not in range(128)

That's strange.

I tested locally on G1_1e7_1e2_5_0 and G1_1e7_1e2_5_0 with polars==0.7.9 and cannot reproduce this error.

Thinking aloud here: It seems that the terminal does not support unicode characters. The library I use for printing tables, uses unicode characters. Maybe I should consider an ascii-complient table format :thinking: .

Does setting the locale at the start of the script make any difference?

import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
jangorecki commented 3 years ago

You are right. I forget I had to "patch" python environment before for py-polars package. Now I reinstalled it as polars to new env where I haven't put the patch after. My patch was more dirty than suggestion you are proposing https://github.com/ritchie46/db-benchmark/blob/4a7618f962621984eb657406a603d3787ea0dc12/polars/setup-polars.sh#L25-L34 I will try and if it works there is no need for patching environment anymore. Thanks

jangorecki commented 3 years ago

Your suggestion didn't work so I hacked polars/py-polars/bin/activate again and it works now.

ritchie46 commented 3 years ago

Your suggestion didn't work so I hacked polars/py-polars/bin/activate again and it works now.

Ok, but it does work? :)

jangorecki commented 3 years ago

yes it works and new polars is already running

ritchie46 commented 3 years ago

yes it works and new polars is already running

Great, thanks for your effort!

jangorecki commented 3 years ago

1e9 groupby is being terminated with timeout, so it takes more than 3h. Tomorrow benchmark run should finish so we will have bigger picture.

ritchie46 commented 3 years ago

1e9 groupby is being terminated with timeout, so it takes more than 3h. Tomorrow benchmark run should finish so we will have bigger picture.

If the questions have not run at all, I think I know what it might be. I was trying to parse the larger datasets from the benchmark and I noticed that the csv-parser on some edge cases scales quadratically.

jangorecki commented 3 years ago

For the current moment report has not been refreshed. Some checksum of the answers produced by polars has changed and that causes report workflow to raise exception.

Error in model_time(clean_time(load_time(path = path))) : 
  Value of 'chk' varies for different runs for single solution+question
Calls: <Anonymous> ... withVisible -> eval -> eval -> time_logs -> model_time

I have to review those checksums and invalidate previous ones if needed (or eventually report changed behavior to you). It can take little while. Further discussed in https://github.com/ritchie46/polars/issues/357

If you want to access timings now, then add /time.csv to report url.

jangorecki commented 3 years ago

@ritchie46 Using 0.7.11 groupby 1e9 data sizes are being killed by OOM during data load.