vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.3k stars 590 forks source link

[BUG-REPORT] df.count error with selections and no limits #2151

Open arunpersaud opened 2 years ago

arunpersaud commented 2 years ago

Description df.count results in an error when called with a list of selections and no limits, but works when limits are given.

Software information

Additional information Here is some example code:

import vaex as vx
df = vx.example()

# this works
hist_x = df.count("*", binby="x", shape=1024, selection=None)

# this gives an IndexError (see below
hist_x = df.count("*", binby="x", shape=1024, selection=[None])

# this works again
hist_x = df.count("*", binby="x", shape=1024, selection=[None], limits=(1, 10))

The error I'm getting is:

>>> hist_x = df.count("*", binby="x", shape=1024, selection=[None])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/dataframe.py", line 965, in count
    return self._compute_agg('count', expression, binby, limits, shape, selection, delay, edges, progress, array_type=array_type)
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/dataframe.py", line 939, in _compute_agg
    return self._delay(delay, progressbar.exit_on(var))
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/dataframe.py", line 1779, in _delay
    return task.get()
  File "/opt/homebrew/lib/python3.9/site-packages/aplus/__init__.py", line 170, in get
    raise self._reason
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/promise.py", line 121, in callAndReject
    ret.fulfill(failure(r))
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/progress.py", line 91, in error
    raise arg
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/promise.py", line 121, in callAndReject
    ret.fulfill(failure(r))
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/delayed.py", line 38, in _wrapped
    raise exc
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/promise.py", line 121, in callAndReject
    ret.fulfill(failure(r))
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/delayed.py", line 38, in _wrapped
    raise exc
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/promise.py", line 121, in callAndReject
    ret.fulfill(failure(r))
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/delayed.py", line 38, in _wrapped
    raise exc
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/promise.py", line 106, in callAndFulfill
    ret.fulfill(success(v))
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/delayed.py", line 82, in call
    return f(*args_real, **kwargs_real)
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/dataframe.py", line 5575, in create_binner
    return self._binner_scalar(expression, limits, shape)
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/dataframe.py", line 5581, in _binner_scalar
    return BinnerScalar(expression, limits[0], limits[1], shape, dtype)
IndexError: index 1 is out of bounds for axis 0 with size 1
JovanVeljanoski commented 2 years ago

Well.. i can admit that this is technically a bug.. but this is also you abusing the system..

In principle selection = [None] should the be same as selection=None I suppose.

Edit: although I see that count crashes for any list of selections without limits.. ok let's see if we can fix it. Thank you for the report!

arunpersaud commented 2 years ago

I just used selection=[None] as an example, a more realistic one would perhaps be

# works
df.count("*", binby="x", shape=1024, selection=[df.x>0, df.x<0], limits=[-10,10])

# doesn't work
df.count("*", binby="x", shape=1024, selection=[df.x>0, df.x<0])

The error for the second one that I'm getting is:

>>> df.count("*", binby="x", shape=1024, selection=[df.x>0, df.x<0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/arun/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 962, in count
    return self._compute_agg('count', expression, binby, limits, shape, selection, delay, edges, progress, array_type=array_type)
  File "/home/arun/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 936, in _compute_agg
    return self._delay(delay, progressbar.exit_on(var))
  File "/home/arun/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 1775, in _delay
    self.execute()
  File "/home/arun/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 417, in execute
    self.executor.execute()
  File "/home/arun/.local/lib/python3.8/site-packages/vaex/execution.py", line 308, in execute
    for _ in self.execute_generator():
  File "/home/arun/.local/lib/python3.8/site-packages/vaex/execution.py", line 345, in execute_generator
    tasks = _merge(tasks)
  File "/home/arun/.local/lib/python3.8/site-packages/vaex/execution.py", line 137, in _merge
    tasks_merged.extend(_merge_tasks_for_df(tasks_df, df))
  File "/home/arun/.local/lib/python3.8/site-packages/vaex/execution.py", line 151, in _merge_tasks_for_df
    tasks_agg_per_grid[task.binners].append(task)
  File "/home/arun/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 7199, in __hash__
    return hash((self.__class__.__name__, self.expression, self.minimum, self.maximum, self.count, self.dtype))
TypeError: unhashable type: 'numpy.ndarray'