h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

pandas not able to solve q10 anymore #142

Closed jangorecki closed 4 years ago

jangorecki commented 4 years ago

It is not a recent issue but now spotted. q10 used to work on pandas 0.24.2 but now is hitting memory error. Affects both 1e7 and 1e8 data sizes. Reported in https://github.com/pandas-dev/pandas/issues/32918

jangorecki commented 4 years ago

Issue is still valid, just pasting full error

Traceback (most recent call last):
  File "./pandas/groupby-pandas.py", line 290, in <module>
    ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1'
:'count'})
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/groupby/generic.py", line 928, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/base.py", line 419, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/base.py", line 386, in _agg
    result[fname] = func(fname, agg_how)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/base.py", line 370, in _agg_1dim
    return colg.aggregate(how)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/groupby/generic.py", line 247, in aggregate
    return getattr(self, func)(*args, **kwargs)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/groupby/groupby.py", line 1371, in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/groupby/groupby.py", line 909, in _cython_agg_general
    return self._wrap_aggregated_output(output)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/groupby/generic.py", line 386, in _wrap_aggregated_output
    return self._reindex_output(result)._convert(datetime=True)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/groupby/groupby.py", line 2483, in _reindex_output
    levels_list, names=self.grouper.names
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/indexes/multi.py", line 552, in from_product
    codes = cartesian_product(codes)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/reshape/util.py", line 58, in cartesian_product
    for i, x in enumerate(X)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
pandas/core/reshape/util.py", line 58, in <listcomp>
    for i, x in enumerate(X)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
numpy/core/fromnumeric.py", line 445, in repeat
    return _wrapfunc(a, 'repeat', repeats, axis=axis)
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
numpy/core/fromnumeric.py", line 51, in _wrapfunc
    return getattr(obj, method)(*args, **kwds)
MemoryError