scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
239 stars 76 forks source link

fix: apply `TBranch.array`'s veto of `AsGrouped` from `ranges_or_baskets` to `HasBranches.arrays` and `HasBranches.iterate` #1269

Closed jpivarski closed 3 months ago

jpivarski commented 3 months ago

Very rarely, TBranches with subbranches contain data, usually (unnecessary/ignorable) counts of the subbranches' data per entry. The interpretation for any TBranch with subbranches is uproot.AsGrouped, which ignores any of the TBranch's own data and makes it a RecordArray of its subbranches instead.

This is correctly accounted for in the arrays (dict of outputs) by assigning the corresponding value to None (so that this event loop won't wait for any data to be read).

In the usual case, the TBranch with subbranches does not have any data, so no ranges are added to ranges_or_baskets. (branch.entries_to_ranges_or_baskets returns an empty list.) However, if the TBranch with subbranches has data, then it will be added to ranges_or_baskets for reading. Whether it actually gets read or not depends on the order that the physical disk or network returns TBasket data (since this event loop stops watching after it receives the number of arrays it is expecting, and arrays has a None for uproot.AsGrouped arrays, so they don't count against the expected number).

1267 encountered hit this issue because the ROOT file does have TBranches with subbranches that have TBaskets (the unusual case). The error message was probably intended to be internal, or it was used to be user-facing but was leftover from an era before the policy was to ignore any data in uproot.AsGrouped, rather than report it as an error to the user. The code that intends to ignore the data can be found in TBranch.array:

https://github.com/scikit-hep/uproot5/blob/80e780347c2dc40c7004a41c22dfc743ba33f008/src/uproot/behaviors/TBranch.py#L1807-L1816

(i.e. skip any TBranches with uproot.AsGrouped interpretation, regardless of whether they might contain data), but the equivalent hadn't been applied to HasBranches.arrays or HasBranches.iterate:

https://github.com/scikit-hep/uproot5/blob/80e780347c2dc40c7004a41c22dfc743ba33f008/src/uproot/behaviors/TBranch.py#L810-L820

https://github.com/scikit-hep/uproot5/blob/80e780347c2dc40c7004a41c22dfc743ba33f008/src/uproot/behaviors/TBranch.py#L1036-L1056

(Note that the if statement skips cached data but not uproot.AsGrouped.)

This PR makes all three array-fetching functions consistent in their exclusion of uproot.AsGrouped interpretations from ranges_or_baskets. (Other array-fetching functions call these, rather than preparing their own ranges_or_baskets.)

jpivarski commented 3 months ago

(More reason for "us" (ATLAS people) to get a small PHYSLITE file in scikit-hep-testdata.)

(I found out just how slow my hotel wifi was by downloading that file...)