scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
845 stars 89 forks source link

Can't concatenate arrays of records with no fields #3258

Open jpivarski opened 1 month ago

jpivarski commented 1 month ago

Version of Awkward Array

HEAD

Description and code to reproduce

This line:

https://github.com/scikit-hep/awkward/blob/e946646c6315ff9bde0c8a2255ee6f7fc6a31539/src/awkward/_broadcasting.py#L471

doesn't handle the possibility that the RecordArray has no fields (len(contents) == 0).

Very, very few functions actually broadcast through fields; the ufuncs raise an error if they haven't been overloaded (and if they have been overloaded, the use the overload instead), ak.broadcast_arrays stops at records by a deliberate decision:

https://github.com/scikit-hep/awkward/blob/e946646c6315ff9bde0c8a2255ee6f7fc6a31539/src/awkward/operations/ak_broadcast_arrays.py#L240

In our test suite, the only examples that reach the record-broadcasting code (broadcast_any_record, past the part that errors-out when options["allow_records"] is false) are ak.transform and ak.concatenate.

I tried to trigger this error and got a different one:

ak.concatenate([ak.Array([{}, {}, {}]), ak.Array([{}, {}])], axis=0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 64, in dispatch
    next(gen_or_result)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_concatenate.py", line 64, in concatenate
    return _impl(arrays, axis, mergebool, highlevel, behavior, attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_concatenate.py", line 160, in _impl
    contents = [ak._do.mergemany(b) for b in batches]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_concatenate.py", line 160, in <listcomp>
    contents = [ak._do.mergemany(b) for b in batches]
                ^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/_do.py", line 218, in mergemany
    return contents[0]._mergemany(contents[1:])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/recordarray.py", line 760, in _mergemany
    next = RecordArray(
           ^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/recordarray.py", line 162, in __init__
    raise TypeError(
TypeError: RecordArray 'length' must be a non-negative integer or None, not awkward._util.UNSET

This error occurred while calling

    ak.concatenate(
        [<Array [{}, {}, {}] type='3 * {}'>, <Array [{}, {}] type='2 * {}'>]
        axis = 0
    )

It's not a high priority because it's such a weird case, but this really should result in

ak.Array([{}, {}, {}, {}, {}], type='5 * {}'>
agoose77 commented 1 month ago

The line linked above should never fail; it finds the first record array content in the broadcasted contents?

jpivarski commented 1 month ago

Ah, that's right! I thought that contents was the contents of one of the RecordArrays:

>>> no_fields_record = ak.Array([{}, {}, {}, {}, {}])
>>> no_fields_record.layout
<RecordArray is_tuple='false' len='5'>
</RecordArray>
>>> no_fields_record.layout.contents
[]
>>> next(c for c in no_fields_record.layout.contents)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

But this contents is the subset of inputs that are Content instances (which enters this function as a closure):

https://github.com/scikit-hep/awkward/blob/e946646c6315ff9bde0c8a2255ee6f7fc6a31539/src/awkward/_broadcasting.py#L429

Okay, this line should not fail. But in trying to make this line fail, something else went wrong in recordarray.py line 760 _mergemany.

So you're right, it was not the error I was looking for, and through some coincidence, I found a different error.

No-field records are not sufficiently tested, perhaps.