h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

pandas and dask could return full dataframe as a result - add reset_index(inplace=True) #68

Closed jangorecki closed 5 years ago

jangorecki commented 5 years ago

Currently pandas and dask are returning "indexed" dataframes. Index (or "MultiIndex") for pandas/dask is a feature corresponding to R data.frame row names. While it brings multiple useful features it seems to be not very beneficial for performance (see this SO). It also complicates operation on those fields because they are not a regular columns anymore. Additionally answers are not matching in dimensions to other solutions because fields in index (columns that you are grouping by, or some extra ones, like when using nlargest method) are not columns. It seems then reasonable to produce answers that are more "complete" and aligned with answers produced by other solutions. Adding ans.reset_index(inplace=True) should be enough and low overhead solution for that. @mattdowle @st-pasha what is your opinion on that?

jangorecki commented 5 years ago

@st-pasha does pyDT have "feature" like that? does it make sense to proceed with proposed change for pandas/dask?

jangorecki commented 5 years ago

I measured time difference for pandas using reset_index(inplace=True) and there was no difference, sometimes it was even faster. Will apply change and see on all tests.

jangorecki commented 5 years ago

pandas on average is 0.1% slower using that which is like 5s vs 4.995s. For dask it is 0.7% difference, like 5s vs 4.965s.