iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

Missing branches in some files #36

Closed alexander-held closed 6 months ago

alexander-held commented 6 months ago

Looking at 2015 data, this file

root://192.170.240.147:1094//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/90/62/DAOD_PHYSLITE.37001626._000001.pool.root.1

does not have electron branches (at least none we can read with uproot). It comes out of data15_13TeV:data15_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp15_v01_p6026. Other files in that container seem fine. I have not tested this systematically across all files in all containers. This causes errors we cannot catch as easily, so we need to understand what is happening and if this file is maybe broken somehow.

gordonwatts commented 6 months ago

I will run this dataset through SX, when it comes back, and look at a few electron things (like $p_T$ and $\eta$ and $\phi$). There there are other branches I should look at, let me know.

alexander-held commented 6 months ago

Same goes for muons, I don't see muon pT either. This file has 18571 events, it would be quite strange if none of them have light leptons without a systematic explanation.

alexander-held commented 6 months ago

@gordonwatts another thing that would be good to look at is reading AnalysisJetsAuxDyn.SumPtTrkPt500 in some specific files. Details in https://its.cern.ch/jira/projects/ATLASDPD/issues/ATLASDPD-2075 (ATLAS-internal link I think).

alexander-held commented 6 months ago

Another issue:

mc20_13TeV:mc20_13TeV.363359.Sherpa_221_NNPDF30NNLO_WpqqWmlv.deriv.DAOD_PHYSLITE.e5583_s3681_r13167_p5855

seems to be missing AnalysisElectronsAuxDyn.DFCommonElectronsECIDSResult. This may be due to the p-tag (a few files, like this one, did not exist in p6026 yet). We can veto those files or this branch if needed. Results in

KeyInFileError: not found: 'AnalysisElectronsAuxDyn.DFCommonElectronsECIDSResult'
in file root://192.170.240.147:1094//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/mc20_13TeV/95/e1/DAOD_PHYSLITE.34860880._000002.pool.root.1
in object /CollectionTree;1

which I assume we can catch as uproot.KeyInFileError. cc @ivukotic

gordonwatts commented 6 months ago

@alexander-held - could you make sure data15_13TeV-data15_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp15_v01_p6026 exists?

[gwatts@AMDOfficeCore ~]$ rucio list-dids data15_13TeV:data15_13TeV-data15_13TeV.periodAllYear.physics_Main.PhysCont.DAO
D_PHYSLITE.grp15_v01_p6026
+--------------+--------------+
| SCOPE:NAME   | [DID TYPE]   |
|--------------+--------------|
+--------------+--------------+
alexander-held commented 6 months ago

This was a typo in the original issue, it should be

data15_13TeV:data15_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp15_v01_p6026

(note the colon instead of dash). I fixed it now.

alexander-held commented 6 months ago

Another strange new missing branch I ran into for data15:

root://192.170.240.145//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/7e/11/DAOD_PHYSLITE.37001656._000010.pool.root.1 failed in 1.25 s
Traceback (most recent call last):
  File "/tmp/ipykernel_23927/1374035382.py", line 79, in uproot_open_materialize
  File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 1627, in __getitem__
    raise uproot.KeyInFileError(
uproot.exceptions.KeyInFileError: not found: 'AnalysisJetsAuxDyn.Timing'

    Available keys: 'AnalysisJetsAux.', 'AnalysisTauJetsAux.', 'AnalysisJets', 'AnalysisLargeRJetsAux.', 'AnalysisElectronsAux.', 'AnalysisMuonsAux.', 'AnalysisPhotonsAux.', 'AnalysisTauJets', 'AnalysisMuons', 'AnalysisLargeRJets'...

in file root://192.170.240.145//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/7e/11/DAOD_PHYSLITE.37001656._000010.pool.root.1
in object /CollectionTree;1

Full list of files where I saw this:

root://192.170.240.145//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/7e/11/DAOD_PHYSLITE.37001656._000010.pool.root.1 failed in 1.25 s

root://192.170.240.141//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/05/2a/DAOD_PHYSLITE.37001656._000011.pool.root.1 failed in 1.57 s

root://192.170.240.146//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/84/e0/DAOD_PHYSLITE.37001656._000012.pool.root.1 failed in 1.00 s

root://192.170.240.143//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/f5/48/DAOD_PHYSLITE.37001656._000013.pool.root.1 failed in 1.60 s

root://192.170.240.144//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/1a/ce/DAOD_PHYSLITE.37001656._000014.pool.root.1 failed in 0.81 s

root://192.170.240.146//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/6d/75/DAOD_PHYSLITE.37001656._000015.pool.root.1 failed in 0.67 s

root://192.170.240.142//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/85/2c/DAOD_PHYSLITE.37001656._000016.pool.root.1 failed in 1.46 s

root://192.170.240.142//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/2f/e9/DAOD_PHYSLITE.37001656._000017.pool.root.1 failed in 1.56 s

root://192.170.240.146//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/52/48/DAOD_PHYSLITE.37001656._000018.pool.root.1 failed in 1.08 s

root://192.170.240.145//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/3c/10/DAOD_PHYSLITE.37001656._000019.pool.root.1 failed in 0.92 s

root://192.170.240.144//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/4d/1a/DAOD_PHYSLITE.37001656._000020.pool.root.1 failed in 1.22 s

root://192.170.240.142//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/1e/63/DAOD_PHYSLITE.37001656._000021.pool.root.1 failed in 1.23 s

root://192.170.240.142//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/b3/49/DAOD_PHYSLITE.37001656._000022.pool.root.1 failed in 1.48 s

root://192.170.240.144//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/fc/e0/DAOD_PHYSLITE.37001656._000023.pool.root.1 failed in 1.18 s

root://192.170.240.142//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/2d/e4/DAOD_PHYSLITE.37001656._000024.pool.root.1 failed in 1.18 s

root://192.170.240.148//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlasdatadisk/rucio/data15_13TeV/91/fd/DAOD_PHYSLITE.37001686._000001.pool.root.1 failed in 1.50 s

root://192.170.240.148//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/2e/5b/DAOD_PHYSLITE.37001876._000001.pool.root.1 failed in 0.83 s

root://192.170.240.141//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data15_13TeV/2b/67/DAOD_PHYSLITE.37001876._000002.pool.root.1 failed in 2.73 s
gordonwatts commented 6 months ago

@alexander-held - this worked fine using EventLoop. So this must be a feature of xAOD. I went after electron $p_T, \eta, \phi, m$ - and I didn't do specific files, but just did the full 10,000 files in the dataset.

This is painful, another "issue" I suppose with how this works for EventLoop, but not for this. It might be worth asking inside the I/O group. Let me know if you want to do that or if I should.

And we can close this once this bit of the loop is closed.

gordonwatts commented 6 months ago

Shifted this to week 6 b.c. this is no longer a code issue, but more a "what is going on here" issue (and marked it documentation too). Once we get an actual understanding add something to #13 .

gordonwatts commented 6 months ago

Serivcex is not affected by this.

gordonwatts commented 6 months ago

From some emails with ATLAS I/O experts:

I downloaded and took a quick look at this file. It seems to contain a bit more than 18k events from Run 266904 and LBs in [3-96], which according to RunQuery is from when we didn't have stable beams. Therefore, it is perhaps not shocking that it doesn't contain very interesting data.

Without going into too much detail and writing a wall of text, let me say that - albeit empty - electron containers are there:

$ root -l data15_13TeV/DAOD_PHYSLITE.37001626._000001.pool.root.1 
root [0] 
Attaching file data15_13TeV/DAOD_PHYSLITE.37001626._000001.pool.root.1 as _file0...
(TFile *) 0x4b9f300
root [1] CollectionTree->Print("*Electron*")
******************************************************************************
*Tree    :CollectionTree: CollectionTree                                         *
*Entries :    18571 : Total =       386706863 bytes  File  Size =   25546596 *
*        :          : Tree compression factor =  15.19                       *
******************************************************************************
*Br    0 :AnalysisElectronsAux. : xAOD::AuxContainerBase                     *
*Entries :    18571 : Total  Size=    1008247 bytes  File Size  =      52530 *
*Baskets :       40 : Basket Size=      47104 bytes  Compression=  19.17     *
*............................................................................*
*Br    1 :AnalysisSiHitElectronsAux. : xAOD::AuxContainerBase                *
*Entries :    18571 : Total  Size=    1008467 bytes  File Size  =      52916 *
*Baskets :       40 : Basket Size=      47104 bytes  Compression=  19.04     *
*............................................................................*
*Br    2 :AnalysisElectrons : DataVector<xAOD::Electron_v1>                  *
*Entries :    18571 : Total  Size=     377999 bytes  File Size  =      42929 *
*Baskets :       51 : Basket Size=      20480 bytes  Compression=   8.77     *
*............................................................................*
*Br    3 :AnalysisSiHitElectrons : DataVector<xAOD::Electron_v1>             *
*Entries :    18571 : Total  Size=     378274 bytes  File Size  =      43110 *
*Baskets :       51 : Basket Size=      20480 bytes  Compression=   8.74     *
*............................................................................*

Note that in PHYS(LITE) we make the conscious choice to write certain xAOD containers through the base class instead of the derived class (note above that the AnalysisElectronsAux is not of type xAOD::ElectronAuxContainer but of xAOD::AuxContainerBase). This allows us to treat the variables that are static for the derived class as dynamic when we write, so they get their own branches. In this case, there is no object to write, so you only end up w/ the empty base containers and nothing else. The framework/IO system is smart enough to understand this. Therefore, it's not really an issue when you read the file.

Best,
Serhan

and

My general expectation would be that if you have a file with no events all kind of things can be missing, maybe even the entire CollectionTree.  Should you have a file with no electrons (presumably a short file), I would expect that the static electron branches are there, but none of the dynamic branches.  So I'd expect e.g. the electron pt to be there, but the various selection flags not to be.  Though maybe with PHYSLITE all variables are dynamic and we don't see anything.

Essentially the issue is that for all the dynamic variables we are looking at the event store to know what's there.  So if we have no electrons we never create any dynamic electron variables, and they correspondingly can't be found in the event store.  However, we still put the empty containers into the event store, so I would expect the static variables to be there.  You could even have situations in which a given decoration never gets written (e.g. all objects fail cuts) and it is then simply missing, even if other variables are there.  In general we try to avoid that though (i.e. fill decorations even for objects that fail selection).

The reason we are not running up against issues more often is in part that we won't try to read branches unless we have valid objects, and in part that whenever we run into issues of this sort we put in place a fix that lets us survive it.  E.g. there is some code in EventLoop that treats a file with no tree as if it had a tree with no events.  IIRC there is also some special code on the output side that if on the 10th event we realize that there is now a decoration that didn't exist before it backfills the corresponding branch for previous events…

Personally I am not the biggest fan of this whole design, given that by the end of initialization we ought to know what all variables are, even if we haven't processed any events/objects yet.  Certainly by the end of the first event it should all be known.  However, we don't track the decorations we are going to create inside Athena (except for some niche cases), so the only option there is to scan what's in the event store.  For something like the CP algorithms a lot of that information would be there, but it's not connected to the xAOD output code.

Cheers,
Nils