scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
832 stars 86 forks source link

ak.zip seems to work recursively, but it doesn't really #2363

Open jpivarski opened 1 year ago

jpivarski commented 1 year ago

In dask-contrib/dask-awkward/issues/213, @masonproffitt asked for

>>> a = ak.Array([1])
>>> ak.zip({'a': a, 'b': {'c': a}})
<Array [{a: 1, b: {...}}] type='1 * {a: int64, b: {c: var * int64}}'>

to work in dask-awkward as it does in Awkward. But ak.zip doesn't really do a nested zip; it just calls ak.to_layout on each of the values of the dict.

https://github.com/scikit-hep/awkward/blob/6a24ed0d436bcd158f634d9bd9f6d664fff6bd2b/src/awkward/operations/ak_zip.py#L174-L190

For a nested dict (general, non-Awkward, non-ndarray container), that means it switches over into ak.from_iter, which (1) is slow, (2) ignores numeric types, and (3) doesn't zip: it makes the difference between an array of structs and a struct of arrays in the data type that you get back.

>>> array = ak.zip({
...     "a": {"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)},
...     "d": {"e": np.arange(10, dtype=np.int32), "f": np.arange(10, dtype=np.float32)},
... })
>>> array.show(type=True)
type: {
    a: {
        b: var * int64,
        c: var * int64
    },
    d: {
        e: var * int64,
        f: var * float64
    }
}
{a: {b: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], c: [0, 1, ..., 9]},
 d: {e: [0, 1, 2, 3, 4, ..., 6, 7, 8, 9], f: [0, 1, ..., 9]}}

whereas

>>> array2 = ak.zip({"b": np.arange(10, dtype=np.int8), "c": np.arange(10, dtype=np.int16)})
>>> array2.show(type=True)
type: 10 * {
    b: int8,
    c: int16
}
[{b: 0, c: 0},
 {b: 1, c: 1},
 {b: 2, c: 2},
 {b: 3, c: 3},
 {b: 4, c: 4},
 {b: 5, c: 5},
 {b: 6, c: 6},
 {b: 7, c: 7},
 {b: 8, c: 8},
 {b: 9, c: 9}]

You see all of the integer types turn into int64 and float32 into float64 because ak.from_iter treats them as Python int and float, which loses dtype. You also see a different structure for the nested object.

It's not obvious to me what the correct behavior is. Treating any expected array-like uniformly with ak.to_layout is good for consistency, but @masonproffitt's interpretation is natural, too.

Originally posted by @jpivarski in https://github.com/dask-contrib/dask-awkward/issues/213#issuecomment-1497887851

agoose77 commented 1 year ago

I agree that this is a policy question.

I'd vote in favour of not recursively zipping, because ak.zip accepts useful parameters that might not apply to each call to ak.zip identically, i.e. the user may well want different depth_limit values. For simplicity and consistency, I'd prefer to require the user to call ak.zip multiple times.

masonproffitt commented 1 year ago

I'd vote in favour of not recursively zipping, because ak.zip accepts useful parameters that might not apply to each call to ak.zip identically, i.e. the user may well want different depth_limit values. For simplicity and consistency, I'd prefer to require the user to call ak.zip multiple times.

I have no problem with this being the default behavior, but I'd love to see automatic recursive zipping as a feature (maybe as an optional argument to ak.zip?). This is a pretty common use case in handling func-adl-uproot queries.