scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

Incorrect number of entries returned by uproot #14

Closed chrisburr closed 7 years ago

chrisburr commented 7 years ago

When opening a ttree with 50000 entries uproot returns numpy arrays containing either 47856 or 47868 entries depending on the column. Plotting the data with TBrowser and loading the data root_pandas both work as expected.

The file I've seen this with can be downloaded using:

xrdcp root://eospublic.cern.ch//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root .

Using uproot

In [0]: import uproot
In [1]: f = uproot.open('PhaseSpaceSimulation.root')
In [2]: tree = f['PhaseSpaceTree']
In [3]: {var: len(data) for var, data in tree.arrays().items()}
{b'B_FlightDistance': 47856,
 b'B_VertexChi2': 47856,
 b'H1_Charge': 47868,
 b'H1_IPChi2': 47868,
 b'H1_PX': 47868,
 b'H1_PY': 47868,
 b'H1_PZ': 47868,
 b'H1_ProbK': 47868,
 b'H1_ProbPi': 47868,
 b'H1_isMuon': 47868,
 b'H2_Charge': 47868,
 b'H2_IPChi2': 47868,
 b'H2_PX': 47868,
 b'H2_PY': 47868,
 b'H2_PZ': 47868,
 b'H2_ProbK': 47868,
 b'H2_ProbPi': 47868,
 b'H2_isMuon': 47868,
 b'H3_Charge': 47868,
 b'H3_IPChi2': 47868,
 b'H3_PX': 47868,
 b'H3_PY': 47868,
 b'H3_PZ': 47868,
 b'H3_ProbK': 47868,
 b'H3_ProbPi': 47868,
 b'H3_isMuon': 47868}

Using ROOT pandas

In [0]: import root_pandas
In [1]: df = root_pandas.read_root('PhaseSpaceSimulation.root')
In [2]: df.apply(len)
[All columns contain 50000 entries as expected]
In [3]: df.isnull().sum()
[No columns contain nan values]
jpivarski commented 7 years ago

I'll take a closer look, but I think this is not a bug (assuming you have slightly less than one item per event). root_numpy and therefore root_pandas present list types as an object array of arrays, which is convenient for Python for loops, but not for acceleration. uproot 1.x presents them in their convenient-for-acceleration form, a single array with no event boundaries. You get event boundaries by additionally extracting their counters (where are the counters for these branches?). I'm still deciding what uproot 2.0 should do by default: uproot 2.0 introduces the ability to swap out different interpretations, so that you can get arrays of arrays if that's what you really want, but the array-with-counter if you have some way to use it.

chrisburr commented 7 years ago

I think this file contains exactly one item per leaf per event. It looks like uproot stops loading events before reaching the end:

import uproot
import root_pandas

f = uproot.open('PhaseSpaceSimulation.root')
tree = f['PhaseSpaceTree']

df = root_pandas.read_root('PhaseSpaceSimulation.root')

var = 'H1_PX'
data_from_uproot = tree.array(var)
data_from_root = df[var]

# This loop is fine
for i in range(len(data_from_uproot)):
    assert data_from_uproot[i] == data_from_root.iloc[i], (var, i, data_from_uproot[i])

# This loop runs errors due to uproot returning fewer entries
for i in range(len(data_from_root)):
    assert data_from_root.iloc[i] == data_from_uproot[i], (var, i, data_from_root.loc[i])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-40-64300bf8645c> in <module>()
     14
     15 for i in range(len(data_from_root)):
---> 16     assert data_from_root.iloc[i] == data_from_uproot[i], (var, i, data_from_root.loc[i])
     17

IndexError: index 47868 is out of bounds for axis 0 with size 47868
print(var, i, data_from_root.iloc[i])
H1_PX 47868 -20555.82
jpivarski commented 7 years ago

I've downloaded the file and looked at it with both uproot 1.5.3 and 2.0-alpha (the second, being a complete rewrite, is an independent look at the file). You're right that these are simple branches: one item per event. However, they're missing their last basket. Getting into the low-level data:

When you call arrays in uproot, you get the contents from the 12 baskets because fWriteBasket says that's how many there are, and this is 47868 entries.

When I call root_numpy.root2array or root_pandas.read_root (which both call into my ROOT 6.08/04), I get malloc segmentation faults, which is about what you'd expect if you tried to seek to byte 0 of the file and interpret it as a basket. I don't know why your version of ROOT is not throwing errors, but check the last 2132 values in the DataFrame to see if it's not all zeros or random numbers or something. A likely scenario is that root_pandas allocates space in a DataFrame for the 50000 entries it expects, then fills it with what it has, and your version of ROOT is not failing with a segmentation fault (possibly raising an error that root_pandas ignores).

Maybe the file-writing procedure was missing the tree->Write(); file->Close() step?

jpivarski commented 6 years ago

In #21, my version of ROOT could recover the missing data, somehow. In #14, my version of ROOT crashed (including root_numpy and root_pandas).

ROOT 6.08/04, built for Ubuntu.

chrisburr commented 6 years ago

Loading this file has worked for me with various builds of root (unfortunately I don't know when/how it was created). I can show it being loaded correctly using the official root docker container as follows.

If I create a file named Dockerfile containing:

FROM rootproject/root-ubuntu16:latest

RUN apt-get update
RUN apt-get install -y python-pip python-numpy python-matplotlib
RUN pip install uproot root_numpy ipython==5

CMD /bin/bash

Then run:

xrdcp root://eospublic.cern.ch//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
docker build -t uproot-test .
docker run --rm -it -v $PWD:/uproot-test -w /uproot-test uproot-test ipython

Then run this at the ipython prompt:

import root_numpy
import ROOT
import uproot

f = ROOT.TFile.Open('PhaseSpaceSimulation.root')
tree = f.Get('PhaseSpaceTree')
f2 = ROOT.TFile.Open('PhaseSpaceSimulation-2.root', 'RECREATE')
t2 = tree.CloneTree()
f2.Write()
f2.Close()

print(root_numpy.root2array('PhaseSpaceSimulation.root', 'PhaseSpaceTree').shape)
print(set(v.shape for k, v in uproot.open('PhaseSpaceSimulation.root').get('PhaseSpaceTree').arrays().items()))

print(root_numpy.root2array('PhaseSpaceSimulation-2.root', 'PhaseSpaceTree').shape)
print(set(v.shape for k, v in uproot.open('PhaseSpaceSimulation-2.root').get('PhaseSpaceTree').arrays().items()))

The result is:

(50000,)
set([(47856,), (47868,)])
(50000,)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-124acff2982f> in <module>()
     15
     16 print(root_numpy.root2array('PhaseSpaceSimulation-2.root', 'PhaseSpaceTree').shape)
---> 17 print(set(v.shape for k, v in uproot.open('PhaseSpaceSimulation-2.root').get('PhaseSpaceTree').arrays().items()))

/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in get(self, name, cycle)
    274                 if key.fName == name:
    275                     if cycle is None or key.fCycle == cycle:
--> 276                         return key.get()
    277             raise KeyError("not found: {0}".format(repr(name)))
    278

/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in get(self, dismiss)
    744
    745             elif classname in self._context.classes:
--> 746                 return self._context.classes[classname].read(self._source, self._cursor.copied(), self._context)
    747
    748             else:

/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in read(cls, source, cursor, context)
    683             context = context.copy()
    684         out = cls.__new__(cls)
--> 685         out = cls._readinto(out, source, cursor, context)
    686         out._postprocess(source, cursor, context)
    687         return out

/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in _readinto(cls, self, source, cursor, context)

/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in read(cls, source, cursor, context)
    683             context = context.copy()
    684         out = cls.__new__(cls)
--> 685         out = cls._readinto(out, source, cursor, context)
    686         out._postprocess(source, cursor, context)
    687         return out

/usr/local/lib/python2.7/dist-packages/uproot/rootio.pyc in _readinto(cls, self, source, cursor, context)

ValueError: attempting to read ROOT_3a3a_TIOFeatures object version 0 with a class generated by streamer version 1

So far as I can tell the rows missing from uproot contain valid looking data and I'm not sure what is going wrong when I try to load the cloned tree with uproot.

jpivarski commented 6 years ago

I still haven't managed to read this file in ROOT, but uproot 2.4.1 reads all 50000 entries, if you call recover().

>>> import uproot
>>> t = uproot.open("PhaseSpaceSimulation.root")["PhaseSpaceTree"]
>>> t.recover()
>>> for var, data in t.arrays().items():
...   print var, len(data)
...
H3_IPChi2 50000
H3_PZ 50000
H3_PY 50000
H3_PX 50000
H2_PX 50000
H2_PY 50000
H2_PZ 50000
H3_ProbPi 50000
H2_IPChi2 50000
B_FlightDistance 50000
H1_Charge 50000
H3_isMuon 50000
H3_ProbK 50000
H1_ProbPi 50000
H1_ProbK 50000
H2_ProbPi 50000
B_VertexChi2 50000
H1_IPChi2 50000
H3_Charge 50000
H1_PY 50000
H1_PX 50000
H1_PZ 50000
H2_isMuon 50000
H2_Charge 50000
H1_isMuon 50000
H2_ProbK 50000

As stated elsewhere, this explicit recover() is temporary. It causes a very different kind of basket (in terms of headers, location, etc.) to be loaded with the rest of the data and I want to be sure that I understand it before applying it to every file. For instance, I want to be sure that it never double-counts, or if so, under what circumstances.

chrisburr commented 6 years ago

I've tested myself and the values read with uproot are identical to those from ROOT once recover() is called.

Thank you for all of your work with this, being able to load root files without ROOT is wonderful!

jpivarski commented 6 years ago

I'm comfortable enough with the recover() function that it is now called automatically on affected branches: https://github.com/scikit-hep/uproot/releases/tag/2.6.1. This function had never been publicized, but now that it's redundant, it has been renamed to _recover() (and you'll probably never call it directly).