vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.26k stars 590 forks source link

[BUG-REPORT]Having issues using mode() #1530

Open mahsheed opened 3 years ago

mahsheed commented 3 years ago

Description I was not able to get the mode() feature to work, and I could not find examples of it being used. Does anyone know what might be the issue?

One approach I tried was the following:

import pandas as pd
import vaex
df = pd.DataFrame({"id": [0, 1, 2, 3], "num" : [1, 1, 3, 4], "str": ["a", "a", "b", None]})
df = vaex.from_pandas(df)
df.mode(expression="num", binby=['id'])

Software information

Additional information Here is the error message I am getting:

ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "/home/ec2-user/workspace/facets-venv/lib64/python3.7/site-packages/vaex/execution.py", line 175, in execute_async
    spec = encoding.encode('task', task)
  File "/home/ec2-user/workspace/facets-venv/lib64/python3.7/site-packages/vaex/encoding.py", line 425, in encode
    encoded = self.registry[typename].encode(self, value)
  File "/home/ec2-user/workspace/facets-venv/lib64/python3.7/site-packages/vaex/tasks.py", line 18, in encode
    return task.encode(encoding)
AttributeError: 'TaskHistogram' object has no attribute 'encode'
23-Aug-21 21:18:16 - vaex.execution - ERROR - error in task, flush task queue
Traceback (most recent call last):
  File "/home/ec2-user/workspace/facets-venv/lib64/python3.7/site-packages/vaex/execution.py", line 175, in execute_async
    spec = encoding.encode('task', task)
  File "/home/ec2-user/workspace/facets-venv/lib64/python3.7/site-packages/vaex/encoding.py", line 425, in encode
    encoded = self.registry[typename].encode(self, value)
  File "/home/ec2-user/workspace/facets-venv/lib64/python3.7/site-packages/vaex/tasks.py", line 18, in encode
    return task.encode(encoding)
AttributeError: 'TaskHistogram' object has no attribute 'encode'
kmcentush commented 3 years ago

Does using df.groupby('id').agg({'num': 'mode'}) achieve your desired result? The TaskHistogram is in some legacy code I'm not familiar with, but the groupby/agg method may do the trick.

mahsheed commented 3 years ago

Hi @kmcentush,

Running that works with 'num': 'mean'but it does not work with 'num': 'mode'.

df.groupby('id').agg({'num': 'mode'}) produces the following error:


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-410-cef13c7bbe73> in <module>
      4 df = vaex.from_pandas(df)
      5 #df.mode(expression="num", binby=['id'])
----> 6 df.groupby('id').agg({'num': 'mode'})

~/workspace/facets-venv/lib64/python3.7/site-packages/vaex/groupby.py in agg(self, actions)
    434         # TODO: this basically forms a cartesian product, we can do better, use a
    435         # 'multistage' hashmap
--> 436         arrays = super(GroupBy, self)._agg(actions)
    437         # we don't want non-existing pairs (e.g. Amsterdam in France does not exist)
    438         counts = self.counts

~/workspace/facets-venv/lib64/python3.7/site-packages/vaex/groupby.py in _agg(self, actions)
    338                 else:
    339                     if isinstance(aggregate, six.string_types):
--> 340                         aggregate = vaex.agg.aggregates[aggregate]
    341                     if callable(aggregate):
    342                         if name is None:

KeyError: 'mode'
kmcentush commented 3 years ago

Hi @mahsheed. I'll dig into this more tomorrow. Definitely seems like a bug based on your stacktraces!

kmcentush commented 3 years ago

Looks like agg doesn't support mode yet. I'm digging into the df.mode() call, and it looks like it's the only legacy task in Vaex that doesn't have the proper helper methods used by all of the other tasks.

@maartenbreddels @JovanVeljanoski, is the ideal fix to make an updated TaskHistogram that is supported by the delayed executor? Or is a better solution to build something out for agg and then have the dataframe just call that and group by the binby arg?