scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

mean, std fail on ChunkedArrays #249

Closed lhenkelm closed 4 years ago

lhenkelm commented 4 years ago

It looks like the error is specific to mean and std. Min, max, and count seem to work ok, I did not try other reducers. I am not sure where the difference originates, but I only saw it for ChunkedArrays. Here is a small example:

>>> import awkward
>>> import numpy as np
>>> chunks = [
...     np.arange(5),
...     np.arange(5),
...     np.arange(5)
... ]
 a = awkward.ChunkedArray(chunks, [5, 5, 5])
>>> a.mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../venv/lib64/python3.6/site-packages/awkward/array/base.py", line 223, in mean
    return self.numpy.true_divide(self.sum(), self.count())
  File ".../venv/lib64/python3.6/site-packages/awkward/array/base.py", line 196, in count
    return self._reduce(None, 0, None)
  File ".../venv/lib64/python3.6/site-packages/awkward/array/chunked.py", line 698, in _reduce
    this = self._util_reduce(chunk[:self._chunksizes[chunkid]], ufunc, identity, dtype)
  File ".../venv/lib64/python3.6/site-packages/awkward/array/base.py", line 554, in _util_reduce
    return ufunc.reduce(array, axis=None)
AttributeError: 'NoneType' object has no attribute 'reduce'
>>> b = awkward.fromiter(np.arange(15))
>>> b.mean()
7.0

This also affects the behavior when calling the numpy functions on the awkward objects:

>>> np.mean(b)
7.0
>>> np.mean(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 6, in mean
  File ".../venv/lib64/python3.6/site-packages/numpy/core/fromnumeric.py", line 3370, in mean
    return mean(axis=axis, dtype=dtype, out=out, **kwargs)
TypeError: mean() got an unexpected keyword argument 'axis'

For now I am working around this by constructing np.arrays and using std, mean of these, so its not super urgent.

jpivarski commented 4 years ago

We should be trailing off the use of Awkward 0, especially for ChunkedArrays, which were very hard to get right in Awkward 0 (lots of NumPy-based edge cases).

Just verifying that all of this is fine in Awkward 1 (see ak.partitioned for this array-creation routine):

>>> import awkward1 as ak
>>> import numpy as np
>>> a = ak.partitioned(lambda n: ak.Array(np.arange(5)), 3)
>>> a
<Array [0, 1, 2, 3, 4, 0, ... 4, 0, 1, 2, 3, 4] type='15 * int64'>
>>> np.mean(a)
2.0

And if the original array came from Awkward 0, use ak.from_awkward0:

>>> import awkward
>>> chunks = [
...     np.arange(5),
...     np.arange(5),
...     np.arange(5)
... ]
>>> a = awkward.ChunkedArray(chunks, [5, 5, 5])
>>> a
<ChunkedArray [0 1 2 ... 2 3 4] at 0x7f3622f65f10>
>>> a1 = ak.from_awkward0(a)
>>> a1
<Array [0, 1, 2, 3, 4, 0, ... 4, 0, 1, 2, 3, 4] type='15 * int64'>
>>> np.mean(a1)
2.0

Note: there's a new restriction that chunking is only allowed at the top-level of a data structure (and hence it has been renamed to "partitioning"). This is in part to reduce complexity of implementation, but also because the value of having an array in partitions is so that different jobs can work on different pieces or the pieces can be loaded one at a time—i.e. either for compute efficiency or for staying within memory limitations—and both of these only need chunking at top-level. If the default keeplayout=False in ak.from_awkward0, deeply nested ChunkedArrays will be concatenated; otherwise, they'd raise an error.

Note on the note: we'll be changing the name of keeplayout to keep_layout (scikit-hep/awkward-1.0#310). This is an oversight in adapting to cleaner names.