Closed chrisburr closed 7 years ago
I'll take a closer look, but I think this is not a bug (assuming you have slightly less than one item per event). root_numpy and therefore root_pandas present list types as an object array of arrays, which is convenient for Python for loops, but not for acceleration. uproot 1.x presents them in their convenient-for-acceleration form, a single array with no event boundaries. You get event boundaries by additionally extracting their counters (where are the counters for these branches?). I'm still deciding what uproot 2.0 should do by default: uproot 2.0 introduces the ability to swap out different interpretations, so that you can get arrays of arrays if that's what you really want, but the array-with-counter if you have some way to use it.
I think this file contains exactly one item per leaf per event. It looks like uproot
stops loading events before reaching the end:
import uproot
import root_pandas
f = uproot.open('PhaseSpaceSimulation.root')
tree = f['PhaseSpaceTree']
df = root_pandas.read_root('PhaseSpaceSimulation.root')
var = 'H1_PX'
data_from_uproot = tree.array(var)
data_from_root = df[var]
# This loop is fine
for i in range(len(data_from_uproot)):
assert data_from_uproot[i] == data_from_root.iloc[i], (var, i, data_from_uproot[i])
# This loop runs errors due to uproot returning fewer entries
for i in range(len(data_from_root)):
assert data_from_root.iloc[i] == data_from_uproot[i], (var, i, data_from_root.loc[i])
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-40-64300bf8645c> in <module>()
14
15 for i in range(len(data_from_root)):
---> 16 assert data_from_root.iloc[i] == data_from_uproot[i], (var, i, data_from_root.loc[i])
17
IndexError: index 47868 is out of bounds for axis 0 with size 47868
print(var, i, data_from_root.iloc[i])
H1_PX 47868 -20555.82
I've downloaded the file and looked at it with both uproot 1.5.3 and 2.0-alpha (the second, being a complete rewrite, is an independent look at the file). You're right that these are simple branches: one item per event. However, they're missing their last basket. Getting into the low-level data:
TTree.fEntries
, TBranch.fEntries
, and TBranch.fEntryNumber
agree that there should be 50000 entries.TBranch.fWriteBasket
says that there are 12 baskets.TBranch.fBasketSeek
has 13 entries (not unusual for this to be longer than fWriteBasket
; we usually have to truncate), and all of the seek values are within the file, but the 13th one is 0
(definitely wrong).When you call arrays
in uproot, you get the contents from the 12 baskets because fWriteBasket
says that's how many there are, and this is 47868 entries.
When I call root_numpy.root2array
or root_pandas.read_root
(which both call into my ROOT 6.08/04), I get malloc segmentation faults, which is about what you'd expect if you tried to seek to byte 0 of the file and interpret it as a basket. I don't know why your version of ROOT is not throwing errors, but check the last 2132 values in the DataFrame to see if it's not all zeros or random numbers or something. A likely scenario is that root_pandas allocates space in a DataFrame for the 50000 entries it expects, then fills it with what it has, and your version of ROOT is not failing with a segmentation fault (possibly raising an error that root_pandas ignores).
Maybe the file-writing procedure was missing the tree->Write(); file->Close()
step?
In #21, my version of ROOT could recover the missing data, somehow. In #14, my version of ROOT crashed (including root_numpy and root_pandas).
ROOT 6.08/04, built for Ubuntu.
Loading this file has worked for me with various builds of root (unfortunately I don't know when/how it was created). I can show it being loaded correctly using the official root docker container as follows.
If I create a file named Dockerfile
containing:
FROM rootproject/root-ubuntu16:latest
RUN apt-get update
RUN apt-get install -y python-pip python-numpy python-matplotlib
RUN pip install uproot root_numpy ipython==5
CMD /bin/bash
Then run:
xrdcp root://eospublic.cern.ch//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
docker build -t uproot-test .
docker run --rm -it -v $PWD:/uproot-test -w /uproot-test uproot-test ipython
Then run this at the ipython prompt:
import root_numpy
import ROOT
import uproot
f = ROOT.TFile.Open('PhaseSpaceSimulation.root')
tree = f.Get('PhaseSpaceTree')
f2 = ROOT.TFile.Open('PhaseSpaceSimulation-2.root', 'RECREATE')
t2 = tree.CloneTree()
f2.Write()
f2.Close()
print(root_numpy.root2array('PhaseSpaceSimulation.root', 'PhaseSpaceTree').shape)
print(set(v.shape for k, v in uproot.open('PhaseSpaceSimulation.root').get('PhaseSpaceTree').arrays().items()))
print(root_numpy.root2array('PhaseSpaceSimulation-2.root', 'PhaseSpaceTree').shape)
print(set(v.shape for k, v in uproot.open('PhaseSpaceSimulation-2.root').get('PhaseSpaceTree').arrays().items()))
The result is:
(50000,)
set([(47856,), (47868,)])
(50000,)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-124acff2982f> in <module>()
15
16 print(root_numpy.root2array('PhaseSpaceSimulation-2.root', 'PhaseSpaceTree').shape)
---> 17 print(set(v.shape for k, v in uproot.open('PhaseSpaceSimulation-2.root').get('PhaseSpaceTree').arrays().items()))
/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in get(self, name, cycle)
274 if key.fName == name:
275 if cycle is None or key.fCycle == cycle:
--> 276 return key.get()
277 raise KeyError("not found: {0}".format(repr(name)))
278
/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in get(self, dismiss)
744
745 elif classname in self._context.classes:
--> 746 return self._context.classes[classname].read(self._source, self._cursor.copied(), self._context)
747
748 else:
/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in read(cls, source, cursor, context)
683 context = context.copy()
684 out = cls.__new__(cls)
--> 685 out = cls._readinto(out, source, cursor, context)
686 out._postprocess(source, cursor, context)
687 return out
/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in _readinto(cls, self, source, cursor, context)
/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in read(cls, source, cursor, context)
683 context = context.copy()
684 out = cls.__new__(cls)
--> 685 out = cls._readinto(out, source, cursor, context)
686 out._postprocess(source, cursor, context)
687 return out
/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in _readinto(cls, self, source, cursor, context)
ValueError: attempting to read ROOT_3a3a_TIOFeatures object version 0 with a class generated by streamer version 1
So far as I can tell the rows missing from uproot contain valid looking data and I'm not sure what is going wrong when I try to load the cloned tree with uproot.
I still haven't managed to read this file in ROOT, but uproot 2.4.1 reads all 50000 entries, if you call recover()
.
>>> import uproot
>>> t = uproot.open("PhaseSpaceSimulation.root")["PhaseSpaceTree"]
>>> t.recover()
>>> for var, data in t.arrays().items():
... print var, len(data)
...
H3_IPChi2 50000
H3_PZ 50000
H3_PY 50000
H3_PX 50000
H2_PX 50000
H2_PY 50000
H2_PZ 50000
H3_ProbPi 50000
H2_IPChi2 50000
B_FlightDistance 50000
H1_Charge 50000
H3_isMuon 50000
H3_ProbK 50000
H1_ProbPi 50000
H1_ProbK 50000
H2_ProbPi 50000
B_VertexChi2 50000
H1_IPChi2 50000
H3_Charge 50000
H1_PY 50000
H1_PX 50000
H1_PZ 50000
H2_isMuon 50000
H2_Charge 50000
H1_isMuon 50000
H2_ProbK 50000
As stated elsewhere, this explicit recover()
is temporary. It causes a very different kind of basket (in terms of headers, location, etc.) to be loaded with the rest of the data and I want to be sure that I understand it before applying it to every file. For instance, I want to be sure that it never double-counts, or if so, under what circumstances.
I've tested myself and the values read with uproot are identical to those from ROOT once recover()
is called.
Thank you for all of your work with this, being able to load root files without ROOT is wonderful!
I'm comfortable enough with the recover()
function that it is now called automatically on affected branches: https://github.com/scikit-hep/uproot/releases/tag/2.6.1. This function had never been publicized, but now that it's redundant, it has been renamed to _recover()
(and you'll probably never call it directly).
When opening a ttree with 50000 entries uproot returns numpy arrays containing either 47856 or 47868 entries depending on the column. Plotting the data with TBrowser and loading the data
root_pandas
both work as expected.The file I've seen this with can be downloaded using:
Using uproot
Using ROOT pandas