scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
239 stars 76 forks source link

Inconsistent Reading of Grouped Branches #1267

Closed SamKelson closed 3 months ago

SamKelson commented 3 months ago

Describe the bug: When attempting to access branches of a TTree in a ROOT file using uproot, an error occurs if grouped branches are included in a set with other branches. Specifically, the arrays method throws an error about grouped branches when attempting to access them alongside other branches. However, the error thrown is inconsistent as there are certain combinations of branches with the grouped branch that do not throw an error.

To Reproduce:

  1. Download public file from cernbox
  2. Use uproot to open the file and access the CollectionTree TTree:
    file = uproot.open("DAOD_PHYSLITE.37233417._000052.pool.root.1")['CollectionTree']
  3. Use the arrays method of TTree to access the following keys provided via set `s:
    
    tracking_keys = set({'GSFTrackParticlesAuxDyn.z0', 'GSFTrackParticlesAuxDyn.numberOfNextToInnermostPixelLayerHits', 'GSFTrackParticlesAuxDyn.definingParametersCovMatrixDiag', 'GSFTrackParticlesAuxDyn.expectNextToInnermostPixelLayerHit', 'GSFTrackParticlesAuxDyn.numberOfSCTHits', 'GSFTrackParticlesAuxDyn.numberOfPixelHits', 'GSFTrackParticlesAuxDyn.phi', 'GSFTrackParticlesAuxDyn.theta', 'GSFTrackParticlesAuxDyn.expectInnermostPixelLayerHit', 'GSFTrackParticlesAuxDyn.qOverP', 'GSFTrackParticlesAuxDyn.numberOfInnermostPixelLayerHits', 'GSFTrackParticlesAuxDyn.d0', 'GSFTrackParticlesAuxDyn.originalTrackParticle', 'GSFTrackParticlesAuxDyn.vz', 'GSFTrackParticlesAuxDyn.definingParametersCovMatrixOffDiag', 'GSFTrackParticlesAuxDyn.chiSquared'})
    electron_keys = set({'AnalysisElectronsAuxDyn.firstEgMotherTruthParticleLink', 'AnalysisElectronsAuxDyn.topoetcone20ptCorrection', 'AnalysisElectronsAuxDyn.ambiguityType', 'AnalysisElectronsAuxDyn.truthType', 'AnalysisElectronsAuxDyn.firstEgMotherTruthOrigin', 'AnalysisElectronsAuxDyn.topoetcone20_CloseByCorr', 'AnalysisElectronsAuxDyn.ptvarcone30_Nonprompt_All_MaxWeightTTVALooseCone_pt1000_CloseByCorr', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHMediumIsEMValue', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHTight', 'AnalysisElectronsAuxDyn.charge', 'AnalysisElectronsAuxDyn.trackParticleLinks', 'AnalysisElectronsAuxDyn.truthOrigin', 'AnalysisElectronsAuxDyn.ptvarcone30_Nonprompt_All_MaxWeightTTVALooseCone_pt500', 'AnalysisElectronsAuxDyn.ambiguityLink/AnalysisElectronsAuxDyn.ambiguityLink.m_persKey', 'AnalysisElectronsAuxDyn.firstEgMotherPdgId', 'AnalysisElectronsAuxDyn.ptcone20_Nonprompt_All_MaxWeightTTVALooseCone_pt1000_CloseByCorr', 'AnalysisElectronsAuxDyn.eta', 'AnalysisElectronsAuxDyn.topoetcone20', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHMedium', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHVeryLooseIsEMValue', 'AnalysisElectronsAuxDyn.phi', 'AnalysisElectronsAuxDyn.firstEgMotherTruthType', 'AnalysisElectronsAuxDyn.f1', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHLooseBL', 'AnalysisElectronsAuxDyn.neflowisol20', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHVeryLoose', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHLooseBLIsEMValue', 'AnalysisElectronsAuxDyn.ptcone20_Nonprompt_All_MaxWeightTTVALooseCone_pt500', 'AnalysisElectronsAuxDyn.ambiguityLink', 'AnalysisElectronsAuxDyn.caloClusterLinks', 'AnalysisElectronsAuxDyn.truthParticleLink', 'AnalysisElectronsAuxDyn.ptcone20_Nonprompt_All_MaxWeightTTVALooseCone_pt1000', 'AnalysisPhotonsAuxDyn.ambiguityLink', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHLooseIsEMValue', 'AnalysisElectronsAuxDyn.truthPdgId', 'AnalysisElectronsAuxDyn.DFCommonElectronsECIDSResult', 'AnalysisElectronsAuxDyn.pt', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHTightIsEMValue', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHLoose', 'AnalysisElectronsAuxDyn.OQ', 'AnalysisElectronsAuxDyn.TruthLink', 'AnalysisMuonsAuxDyn.pt', 'AnalysisElectronsAuxDyn.m', 'AnalysisElectronsAuxDyn.DFCommonElectronsECIDS', 'AnalysisElectronsAuxDyn.author', 'AnalysisElectronsAuxDyn.ptvarcone30_Nonprompt_All_MaxWeightTTVALooseCone_pt1000'})
    combined_keys = set({'GSFTrackParticlesAuxDyn.vz', 'AnalysisElectronsAuxDyn.OQ', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHLooseBLIsEMValue', 'AnalysisElectronsAuxDyn.m', 'GSFTrackParticlesAuxDyn.definingParametersCovMatrixDiag', 'AnalysisElectronsAuxDyn.pt', 'AnalysisElectronsAuxDyn.topoetcone20', 'AnalysisElectronsAuxDyn.firstEgMotherPdgId', 'AnalysisElectronsAuxDyn.topoetcone20_CloseByCorr', 'AnalysisElectronsAuxDyn.neflowisol20', 'AnalysisElectronsAuxDyn.truthParticleLink', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHTight', 'AnalysisElectronsAuxDyn.trackParticleLinks', 'AnalysisElectronsAuxDyn.charge', 'GSFTrackParticlesAuxDyn.z0', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHVeryLoose', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHVeryLooseIsEMValue', 'AnalysisElectronsAuxDyn.topoetcone20ptCorrection', 'GSFTrackParticlesAuxDyn.theta', 'AnalysisElectronsAuxDyn.ambiguityLink/AnalysisElectronsAuxDyn.ambiguityLink.m_persKey', 'GSFTrackParticlesAuxDyn.definingParametersCovMatrixOffDiag', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHMedium', 'AnalysisElectronsAuxDyn.ambiguityType', 'AnalysisElectronsAuxDyn.phi', 'AnalysisElectronsAuxDyn.ptvarcone30_Nonprompt_All_MaxWeightTTVALooseCone_pt500', 'GSFTrackParticlesAuxDyn.numberOfSCTHits', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHMediumIsEMValue', 'AnalysisPhotonsAuxDyn.ambiguityLink', 'AnalysisElectronsAuxDyn.TruthLink', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHTightIsEMValue', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHLooseBL', 'AnalysisElectronsAuxDyn.firstEgMotherTruthOrigin', 'GSFTrackParticlesAuxDyn.phi', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHLooseIsEMValue', 'AnalysisElectronsAuxDyn.firstEgMotherTruthType', 'AnalysisElectronsAuxDyn.DFCommonElectronsECIDSResult', 'GSFTrackParticlesAuxDyn.originalTrackParticle', 'GSFTrackParticlesAuxDyn.expectNextToInnermostPixelLayerHit', 'AnalysisElectronsAuxDyn.ptvarcone30_Nonprompt_All_MaxWeightTTVALooseCone_pt1000', 'GSFTrackParticlesAuxDyn.numberOfPixelHits', 'AnalysisElectronsAuxDyn.ptcone20_Nonprompt_All_MaxWeightTTVALooseCone_pt1000', 'AnalysisElectronsAuxDyn.truthOrigin', 'GSFTrackParticlesAuxDyn.chiSquared', 'GSFTrackParticlesAuxDyn.expectInnermostPixelLayerHit', 'AnalysisElectronsAuxDyn.DFCommonElectronsECIDS', 'AnalysisElectronsAuxDyn.caloClusterLinks', 'AnalysisMuonsAuxDyn.pt', 'AnalysisElectronsAuxDyn.ptcone20_Nonprompt_All_MaxWeightTTVALooseCone_pt1000_CloseByCorr', 'AnalysisElectronsAuxDyn.ambiguityLink', 'AnalysisElectronsAuxDyn.DFCommonElectronsLHLoose', 'GSFTrackParticlesAuxDyn.numberOfNextToInnermostPixelLayerHits', 'AnalysisElectronsAuxDyn.eta', 'AnalysisElectronsAuxDyn.f1', 'GSFTrackParticlesAuxDyn.numberOfInnermostPixelLayerHits', 'AnalysisElectronsAuxDyn.truthType', 'AnalysisElectronsAuxDyn.ptvarcone30_Nonprompt_All_MaxWeightTTVALooseCone_pt1000_CloseByCorr', 'GSFTrackParticlesAuxDyn.qOverP', 'AnalysisElectronsAuxDyn.firstEgMotherTruthParticleLink', 'GSFTrackParticlesAuxDyn.d0', 'AnalysisElectronsAuxDyn.author', 'AnalysisElectronsAuxDyn.ptcone20_Nonprompt_All_MaxWeightTTVALooseCone_pt500', 'AnalysisElectronsAuxDyn.truthPdgId'})

print(file.arrays(tracking_keys)) print(file.arrays(electron_keys)) print(file.arrays(combined_keys))

4. See the following error with some variety as to which branch it fails on, one example is: `grouping branches like 'AnalysisElectronsAuxDyn.TruthLink' should not be read directly; instead read the subbranches` on the final `arrays` call.

**Additional Notes:**
The error is sometimes thrown on different `arrays` calls and is inconsistent which actual call throws the error; adding to the weirdness of this bug.

**Expected Behavior:**
uproot should consitently open branches including grouped branches even when listed with other branches. For instance, opening the above `AnalysisElectronsAuxDyn.TruthLink` branch works fine by itself but not when in a set with other branches.

This works: 

print(file.arrays(['AnalysisElectronsAuxDyn.TruthLink']))


This should also work where `combined_keys` contains other valid keys including the one mentioned above:

print(file.arrays(combined_keys))



**Context:**
This issue was originally noticed while updating [Coffea's](https://github.com/CoffeaTeam/coffea) PHYSLITE schema where uproot's `dask` method is used to open TTree branches using the `arrays` method.

**System info:**
 - OS: Alma 9 Linux
 - Uproot version: 5.3.10

**Adding other people who helped find this issue:**
- @matthewfeickert 
- @nsmith- 
- @kratsg 
matthewfeickert commented 3 months ago

@SamKelson This is an ATLAS open data file, right? Is this one that you downloaded from the CERN open data portal and so have the metadata for available easily? Or was this just a file that you copied from what's been replicated to the UChicago AF (we can still determine the file from the metadata and link to the original open data file on the CERN data portal)?

SamKelson commented 3 months ago

This file was linked to me by @ekourlit so any questions regarding the file itself will have to be directed to him.

jpivarski commented 3 months ago

The issue can be reproduced consistently by selecting one TBranch with uproot.AsGrouped interpretation and attempting to read that with TTree.arrays or TTree.iterate:

tree = uproot.open("uproot-issue-1267.root")["CollectionTree"]
tree.arrays(["AnalysisPhotonsAuxDyn.ambiguityLink"])
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/jpivarski/irishep/uproot5/src/uproot/behaviors/TBranch.py", line 823, in arrays
    _ranges_or_baskets_to_arrays(
  File "/Users/jpivarski/irishep/uproot5/src/uproot/behaviors/TBranch.py", line 3165, in _ranges_or_baskets_to_arrays
    uproot.source.futures.delayed_raise(*obj)
  File "/Users/jpivarski/irishep/uproot5/src/uproot/source/futures.py", line 38, in delayed_raise
    raise exception_value.with_traceback(traceback)
  File "/Users/jpivarski/irishep/uproot5/src/uproot/behaviors/TBranch.py", line 3114, in basket_to_array
    basket_arrays[basket.basket_num] = interpretation.basket_array(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jpivarski/irishep/uproot5/src/uproot/interpretation/grouped.py", line 119, in basket_array
    raise ValueError(
ValueError: grouping branches like 'AnalysisPhotonsAuxDyn.ambiguityLink' should not be read directly; instead read the subbranches:

    'AnalysisPhotonsAuxDyn.ambiguityLink.m_persKey', 'AnalysisPhotonsAuxDyn.ambiguityLink.m_persIndex'

in file uproot-issue-1267.root
in object /CollectionTree;1:AnalysisPhotonsAuxDyn.ambiguityLink

However, it had been working in TBranch.array because a correct handling of uproot.AsGrouped had been implemented there, but not applied to TTree.arrays and TTree.iterate.

PR #1269 should fix it.

(Your file is unusual in that TBranches with subbranches, which have uproot.AsGrouped interpretation, usually don't have any data. Usually, a TBranch either has subbranches or has data, but not both. It looks like that data in your case consists of counts of the numbers of items per event, which is redundant because the subbranches also have this information.

The reason that the error message depended on the order in which you attempted to read data was because (1) we are not supposed to read data from these uproot.AsGrouped branches, and it was an Uproot bug because we were, (2) there's an event loop that waits for data to be returned by a physical source, such as disk or the network, and these uproot.AsGrouped branches aren't counted in the number of arrays to read because we're not supposed to read them, and (3) the physical source can supply the TBaskets in any order. In some cases, the uproot.AsGrouped TBaskets were read earlier and hit the error message upon interpretation, and in other cases, they were read later and we got out of the event loop before reaching the error.

Even though this looks like a user-facing error message, telling you what to do, it can only arise out of inconsistency, an inconsistency that was just fixed with the new PR. It was probably written that way before we decided to silently fix uproot.AsGrouped reading rather than complain to users about it.)

Let me know if the PR fixes your issue!

matthewfeickert commented 3 months ago

Let me know if the PR fixes your issue!

LGTM. Thanks @jpivarski!

$ docker run --rm -ti python:3.12 /bin/bash
root@f58ac9ea5722:/# python -m venv venv && . venv/bin/activate
(venv) root@f58ac9ea5722:/# python -m pip --quiet install uv
(venv) root@f58ac9ea5722:/# uv pip install 'git+https://github.com/scikit-hep/uproot5.git@jpivarski/dont-read-AsGrouped-TBranches'
 Updated https://github.com/scikit-hep/uproot5.git (fce7e94)
Resolved 7 packages in 2.49s
   Built uproot @ git+https://github.com/scikit-hep/uproot5.git@fce7e9455e44c0f125a366ca768bcc306f0d5fda
Prepared 6 packages in 5.50s
Installed 7 packages in 22ms
 + awkward==2.6.7
 + awkward-cpp==37
 + cramjam==2.8.3
 + fsspec==2024.6.1
 + numpy==2.0.1
 + packaging==24.1
 + uproot==5.3.11.dev5+gfce7e94 (from git+https://github.com/scikit-hep/uproot5.git@fce7e9455e44c0f125a366ca768bcc306f0d5fda)
(venv) root@f58ac9ea5722:/# curl -sLO https://cernbox.cern.ch/remote.php/dav/public-files/wJGWzAyirlWE6QV/DAOD_PHYSLITE.37233417._000052.pool.root.1
(venv) root@f58ac9ea5722:/# ls -lh DAOD_PHYSLITE.37233417._000052.pool.root.1 
-rw-r--r-- 1 root root 1.1G Aug 12 22:05 DAOD_PHYSLITE.37233417._000052.pool.root.1
(venv) root@f58ac9ea5722:/# python -i
Python 3.12.5 (main, Aug  7 2024, 19:13:43) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uproot
>>> tree = uproot.open("DAOD_PHYSLITE.37233417._000052.pool.root.1")["CollectionTree"]
>>> tree.arrays(["AnalysisPhotonsAuxDyn.ambiguityLink"])
<Array [{...}, {...}, {...}, ..., {...}, {...}] type='90000 * {"AnalysisPho...'>
>>> tree
<TTree 'CollectionTree' (1343 branches) at 0x76f2647933b0>
>>>