Question about pandas join benchmark setup

h2oai / db-benchmark

reproducible benchmark of database-like ops

https://h2oai.github.io/db-benchmark

Mozilla Public License 2.0

321 stars 85 forks source link

Question about pandas join benchmark setup #156

Closed ivirshup closed 3 years ago

ivirshup commented 3 years ago

https://github.com/h2oai/db-benchmark/blob/7adc04352d458410536be1e08f509ad4dc06eb72/pandas/join-pandas.py#L30-L42

I'm a little confused about this code in the pandas join benchmark.

First, why are the same columns being converted to categorical repeatedly? Should this be happening to different dataframes?

Second, why convert integer values to categorical values? I'd understand this for string values, but doing this seems slower than just using the integer values.

jangorecki commented 3 years ago

Thanks for spotting that. That conversation should obviously be on each table, not repeatedly on x table.

id1:id3 are string, id4:id6 are integer.

ivirshup commented 3 years ago

No problem!

Is the data used for this generated by _data/join-datagen.R? For me, that generates data where id1:id3 are integers, id4:id6 are strings and v* columns are floats.

As an example, after running Rscript _data/join-datagen.R 1e7 0 0 0 the top of the file data/J1_1e7_NA_0_0.csv looks like:

id1,id2,id3,id4,id5,id6,v1
8,2149,7609766,id8,id2149,id7609766,89.031743
4,4831,9001786,id4,id4831,id9001786,83.712121
3,8157,8096754,id3,id8157,id8096754,33.983582
2,5816,8216251,id2,id5816,id8216251,88.726157

jangorecki commented 3 years ago

You are absolutely correct. I mixed that with groupby data where leading columns are categorical type. Thank you very much.

jangorecki commented 3 years ago

Report was rerun but because of this pandas version was already run it was not refreshed now. New timings will be refreshed after new pandas version release. Or if I force rerun it, but for a couple weeks I am not having workstation to do this.

ivirshup commented 3 years ago

Thanks for the quick responses on this!

jangorecki commented 3 years ago

Let's keep this issue open till timings will be updated on the report.

jangorecki commented 3 years ago

Pandas version got updated in the meantime so timings are now refreshed using categoricals properly. Thanks again.