scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

Incorrect HDF5 serialization of JaggedArray with starts and stops #210

Closed jrueb closed 4 years ago

jrueb commented 4 years ago

See the following code

a = awkward.JaggedArray(starts=[2], stops=[4], content=[1., 2., 3., 4.])
with h5py.File("test.hdf5", "w") as f:
    out = awkward.hdf5(f)
    out["a"] = a
    print("{}, starts={} stops={} content={}".format(a, a.starts, a.stops, a.content))
    print("{}, starts={} stops={} content={}".format(out["a"], out["a"].starts, out["a"].stops, out["a"].content))
with h5py.File("test.hdf5", "r") as f:
    out = awkward.hdf5(f)
    print("{}, starts={} stops={} content={}".format(out["a"], out["a"].starts, out["a"].stops, out["a"].content))

I would expect the three output lines to be identical. However the last two, the ones that print out["a"], are different from the first. It seems the JaggedArray defined by starts and stops is incorrectly serialized, thus out gives a different JaggedArray.

jpivarski commented 4 years ago

It's not supposed to return exactly the same structure, but it is supposed to return an equivalent awkward array—something that would return the same thing from a.tolist():

>>> a.tolist()
[[3.0, 4.0]]
>>> out["a"].tolist()
[[3.0, 4.0]]

>>> a.starts, a.stops, a.content
(array([2]), array([4]), array([1., 2., 3., 4.]))
>>> out["a"].starts, out["a"].stops, out["a"].content
(array([0]), array([2]), array([3., 4.]))

However, this was broken, and I put a fix into PR #211. (We can pack better by saving counts, rather than starts and stops, but if we do so, we can't save the whole content. If canuseoffset is True, then it would be content[starts[0]:stops[-1]], but that would require a special case for empty arrays in addition to the canuseoffset check. The fix I put in is simpler: replacing content with flatten() will always work.)

jpivarski commented 4 years ago

Turns out, I have to do the more complex test because of an infinite recursion case.

jrueb commented 4 years ago

It's not supposed to return exactly the same structure, but it is supposed to return an equivalent awkward array—something that would return the same thing from a.tolist():

Yes, I understand. What I should have written is that the __str__() output should be identical.

Anyway, thanks for the quick fix!

jpivarski commented 4 years ago

For small, shallow arrays, __str__() would be a good test, but __str__() suppresses any lists or sublists with more than 6 elements (replacing them with ...). In the unit tests, tolist() is the standard for correctness, which gives me flexibility to change the internal structure as needed.

By the way, this is the motivation for hiding structure classes like JaggedArray inside of a single awkward.Array class in Awkward 1.0. I found during the tutorials that the fact that multiple combinations of starts, stops, and content correspond to a single logical array was confusing to a lot of people. It's still a good way to organize things, but should be hidden.

Thanks for the bug-report!