Closed ivirshup closed 3 years ago
Thanks for spotting that. That conversation should obviously be on each table, not repeatedly on x
table.
id1:id3 are string, id4:id6 are integer.
No problem!
Is the data used for this generated by _data/join-datagen.R
? For me, that generates data where id1:id3
are integers, id4:id6
are strings and v*
columns are floats.
As an example, after running Rscript _data/join-datagen.R 1e7 0 0 0
the top of the file data/J1_1e7_NA_0_0.csv
looks like:
id1,id2,id3,id4,id5,id6,v1
8,2149,7609766,id8,id2149,id7609766,89.031743
4,4831,9001786,id4,id4831,id9001786,83.712121
3,8157,8096754,id3,id8157,id8096754,33.983582
2,5816,8216251,id2,id5816,id8216251,88.726157
You are absolutely correct. I mixed that with groupby data where leading columns are categorical type. Thank you very much.
Report was rerun but because of this pandas version was already run it was not refreshed now. New timings will be refreshed after new pandas version release. Or if I force rerun it, but for a couple weeks I am not having workstation to do this.
Thanks for the quick responses on this!
Let's keep this issue open till timings will be updated on the report.
Pandas version got updated in the meantime so timings are now refreshed using categoricals properly. Thanks again.
https://github.com/h2oai/db-benchmark/blob/7adc04352d458410536be1e08f509ad4dc06eb72/pandas/join-pandas.py#L30-L42
I'm a little confused about this code in the pandas join benchmark.
First, why are the same columns being converted to categorical repeatedly? Should this be happening to different dataframes?
Second, why convert integer values to categorical values? I'd understand this for string values, but doing this seems slower than just using the integer values.