scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
848 stars 89 forks source link

Different behavior for awkward 2.x on ak.flatten with unknown type #2207

Closed raymondEhlers closed 1 year ago

raymondEhlers commented 1 year ago

Version of Awkward Array

2.0.6

Description and code to reproduce

In some of my code, I frequently check that there are entries left in the array before trying to proceed to the next step. My use case is calling out to some c++ code to do jet finding event by event, so if there are no jets found, it eventually returns a full array of events, all of which are empty.

In awkward 1.x, I used ak.flatten(array["data"].px, axis=None) to flatten out a record (px is one of the fields, not calculated by vector) to check for entries (the particular record didn't really matter - I just needed one). In awkward 2.x, if I do the same, I get an assertion error:

In [13]: ak.flatten(jets["data"].px, axis=None)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In [13], line 1
----> 1 ak.flatten(jets["data"].px, axis=None)

File ~/software/dev/mammoth/.venv/lib/python3.10/site-packages/awkward/operations/ak_flatten.py:161, in flatten(array, axis, highlevel, behavior)
      9 """
     10 Args:
     11     array: Array-like data (anything #ak.to_layout recognizes).
   (...)
    155      999]
    156 """
    157 with ak._errors.OperationErrorContext(
    158     "ak.flatten",
    159     dict(array=array, axis=axis, highlevel=highlevel, behavior=behavior),
    160 ):
--> 161     return _impl(array, axis, highlevel, behavior)

File ~/software/dev/mammoth/.venv/lib/python3.10/site-packages/awkward/operations/ak_flatten.py:173, in _impl(array, axis, highlevel, behavior)
    168     out = ak._do.completely_flatten(layout, function_name="ak.flatten")
    169     assert isinstance(out, tuple) and all(
    170         isinstance(x, ak.contents.NumpyArray) for x in out
    171     )
--> 173     result = ak._do.mergemany(out)
    175     return ak._util.wrap(result, behavior, highlevel)
    177 elif axis == 0 or ak._util.maybe_posaxis(layout, axis, 1) == 0:

File ~/software/dev/mammoth/.venv/lib/python3.10/site-packages/awkward/_do.py:303, in mergemany(contents)
    302 def mergemany(contents: list[Content]) -> Content:
--> 303     assert len(contents) != 0
    304     return contents[0]._mergemany(contents[1:])

AssertionError:

Some more info on the array:

In [12]: jets.type.show()
81 * {
    data: var * Momentum4D[
        px: unknown,
        py: unknown,
        pz: unknown,
        E: unknown,
        area: unknown,
        rho_value: float64,
        constituents: var * Momentum4D[
            px: float64,
            py: float64,
            pz: float64,
            E: float64,
            index: int64
        ],
        unsubtracted_leading_track_pt: unknown
    ]
}

It seems to be about the "unknown" nature of the type, since if I do the same with pt, vector calculates that field, which then types it with float64, and the call works. Also note that if I flatten without axis=None, it works fine. A pickle with the jets array is attached regression_jets.pkl.zip (with .zip as the extension to allow it to be uploaded). I can in principle switch over to pt, but it would be nicer if this worked as in awkward 1.x . Thanks!

edit: I know for sure that the behavior changed, but now I'm a bit less confident it is truly a regression as opposed to the possibility that it's an intended change in behavior. If so, I suppose please let me know, and I'll have to dig through my codebase to change it

agoose77 commented 1 year ago

@jpivarski EmptyArray._remove_structure currently returns []. We could return [self], which would likely solve this bug, and probably be more useful too.

jpivarski commented 1 year ago

I don't remember what Content._remove_structure means. Does the v1 version of it (effectively, translating from C++) return [self]?

agoose77 commented 1 year ago

It's a rename of v2's Content._completely_flatten with an extension to support keeping list dimensions (but making the outer dimensions all length 1). We use this for axis=None reduction with keepdims=True

The suggestion to change its behaviour is not motivated by it being a bug inasmuch as "what would be more useful"?

jpivarski commented 1 year ago

The v1 completely_flatten was in Python, in _util.py.

And it had EmptyArrays (unknowntypes) go to an empty NumPy-like array (of bool_ for some reason[^1]):

https://github.com/scikit-hep/awkward/blob/80bbef0738a6b7928333d7c705ee1b359991de5b/src/awkward/_util.py#L564-L576

So the new behavior of having _remove_structure go to [] is definitely different and could be the cause of a bug that was not in v1. If _remove_structure was returning NumPy-like arrays, EmptyArray should return an empty NumPy-like array, but I think you changed it to return Content subclasses now, right? (To preserve option-types.)

[^1]: Maybe because bool_ combines with any other dtype as an identity? The concatenation of an array of bool_ and an array of T returns an array of T?

agoose77 commented 1 year ago

Thanks for digging!

EmptyArray should return an empty NumPy-like array, but I think you changed it to return Content subclasses now, right?

Yes, now this returns 1D contents, so it's legitimate for us to return EmptyArray. In general, if we can avoid interpreting the EmptyArray as some arbitrary NumpyArray type, we preserve information, so that's why I'm in favour of such a solution here.