scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

Reading CMS open data #496

Closed pierinim closed 4 years ago

pierinim commented 4 years ago

I am trying to use uproot to read CMS open data. Everything works ok, except when I try to access the auxiliary information

import uproot as ur

f = ur.open("root://eospublic.cern.ch//eos/opendata/cms/Run2012B/DoubleMuParked/AOD/22Jan2013-v1/20000/0A9D2B29-9067-E211-842B-0025905280BE.root") evt = f.get("Events") d = evt.get("EventAuxiliary")

print(d.keys()) []

I am using Python 2.7.15+ from within CMSSW_10_6_2.

Any idea of what I might be doing wrong?

jpivarski commented 4 years ago

Isn't "EventAuxiliary" a TTree in the file, rather than a branch in the "Events" TTree? In that case, shouldn't you be getting it from f["EventAuxiary"]?

pierinim commented 4 years ago

Not sure about more recent versions, but back then, it was under Events. At least, this is what I see

f = ur.open("root://eospublic.cern.ch//eos/opendata/cms/Run2012B/DoubleMuParked/AOD/22Jan2013-v1/20000/0A9D2B29-9067-E211-842B-0025905280BE.root") print(f.keys())

returns

['MetaData;1', 'ParameterSets;1', 'Parentage;1', 'Events;1', 'LuminosityBlocks;1', 'Runs;1']

EventAuxiliary is under Events. This is what .show() returns

EventAuxiliary TStreamerInfo astable(asdtype("[('run', '>u4'), ('luminosityBlock', '>u4'), ('event', '>u4'), ('timeLow', '>u4'), ('timeHigh_', '>u4'), ('luminosityBlock2', '>u4'), ('isRealData', '?'), ('experimentType', '>i4'), ('bunchCrossing', '>i4'), ('orbitNumber', '>i4'), ('storeNumber', '>i4')]"))

I am not understanding how to go from here to read the info I need. I guess this has to do with interpreting the various items in the right way. But I am clearly failing at that.

pierinim commented 4 years ago

I was basically trying to follow what you suggested on "Reading custom classes #124" but it is failing and I guess this is because I don't really know what to put here

'isRealData_', '?'),

tamasgal commented 4 years ago

The ? means that isRealData_ is a boolean field, you don't have to put there anything.

Have you tried to simply read an array (or lazyarray) of that branch?

Based on your code snippet above, you can try this:

>>> f = ur.open("root://eospublic.cern.ch//eos/opendata/cms/Run2012B/DoubleMuParked/AOD/22Jan2013-v1/20000/0A9D2B29-9067-E211-842B-0025905280BE.root")
>>> evt = f.get("Events")
>>> d = evt.get("EventAuxiliary")
>>> d.lazyarray()
jpivarski commented 4 years ago

Lazy array doesn't tell you that it's going to be successful until you try to get one element from the lazy array.

tamasgal commented 4 years ago

I just tried, it at least shows that edm::EventAuxiliary failed to be read:

>>> f['Events/EventAuxiliary'].lazyarray()
<ChunkedArray [<Undefined (failed to read 'edm_3a3a_EventAuxiliary' version 10) at 0x0002262413d0> <Undefined (failed to read 'edm_3a3a_EventAuxiliary' version 10) at 0x000226241350> <Undefined (failed to read 'edm_3a3a_EventAuxiliary' version 10) at 0x000226241410> ... <Undefined (failed to read 'edm_3a3a_EventAuxiliary' version 10) at 0x000226241410> <Undefined (failed to read 'edm_3a3a_EventAuxiliary' version 10) at 0x000226241350> <Undefined (failed to read 'edm_3a3a_EventAuxiliary' version 10) at 0x000226241410>] at 0x00021829c450>

But I see a completely different streamer info, so I am not sure what @pierinim was looking at:

>>> f['Events/EventAuxiliary'].show()
EventAuxiliary             TStreamerInfo              asgenobj(edm_3a3a_EventAuxiliary)
jpivarski commented 4 years ago

Okay, I got a chance to look into it. There's a few things I can point out, though ultimately I don't have a solution for reading it.

  1. evt = f.get("Events") makes evt a TTree and d = evt.get("EventAuxiliary") makes d a TBranch. (I was mistaken about the structure of CMS EDM files; "EventAuxiliary" is indeed a TBranch under the "Events" TTree.) But then d.keys() == [] is fine; it just means there are no TBranches nested within this TBranch. In the file you're looking at, the EventAuxiliary is not split (see this recent tutorial). That means that all of the fields of this object are serialized together in this TBranch, instead of being distributed among nested TBranches. If any one of those fields is not recognized by Uproot, the whole thing will be unreadable. If it's possible to re-make the file with splitting turned on, doing so would have numerous benefits, including, possibly, the ability for Uproot to read it.
  2. If your Uproot is showing the EventAuxiliary interpretation (i.e. d.show() or d.interpretation) as astable(...) instead of asgenobj(edm_3a3a_EventAuxiliary) then you might have an old version of Uproot that is making a mistaken interpretation. (The current uproot.__version__ is 3.11.7.) I don't know how up-to-date the CMSSW Uproot versions are kept, but in the past, they've been way out of date.
  3. Looking at it with the latest Uproot, I see that there's a string in EventAuxiliary (a field named processGUID) that prevents reading it as a table. It's a pity because astable(...) is much faster than asgenobj(...), but astable(...) is only possible for data without variable-width things in it, like a string. (Incidentally, splitting the object also gains back speed, since accessing the numeric sub-TBranches can be vectorized like astable(...) and unlike asgenobj(...).)
  4. In addition to that, something inside of the EventAuxiliary is preventing it from being interpreted. Here's how I check that:
>>> # read as little as possible (entrystop=1) and get the first entry
>>> d.array(entrystop=1)[0]
<Undefined (failed to read 'edm_3a3a_EventAuxiliary' version 10) at 0x7a7d31cec7d0>

The current version of Uproot doesn't give much information about why an object couldn't be read, so it would be hard for me to give you an answer. Uproot 4 is being built with that accountability built in, because finding out why it failed is the first step in making it succeed. But that's still about a month in the future, and it may take longer to get it into CMSSW.

At the moment, your best chance for reading this is to get the file written in split mode, because all of these fields would definitely be accessible if it were split: "run", "luminosityBlock", "event" (int), "timeLow", "timeHigh", "luminosityBlock", "isRealData", "experimentType", "bunchCrossing", "orbitNumber", "storeNumber". Possibly others as well, but all of these are simple numeric types—if a numeric type is in a TBranch by itself (because splitting is turned on), then it can definitely be read.

pierinim commented 4 years ago

Thanks for the help. Unfortunately there is no chance to remake the files. These are the legacy data from cms run I and they are frozen. I might have to change strategy and use some more complicated workflow (uproot for the rest, fwlite for this, then recombining datasets, then selecting...)