scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
315 stars 67 forks source link

tree.pandas.df() with branches==None AttributeError #102

Closed balarsen closed 6 years ago

balarsen commented 6 years ago

This seems like it crept in recently, as it used to work.

>>> import uproot
>>> tree = uproot.open("Zmumu.root")["events"]
>>> tree.pandas.df(["pt1", "eta1", "phi1", "pt2", "eta2", "phi2"])

... works

>>> import uproot
>>> tree = uproot.open("Zmumu.root")["events"]
>>> tree.pandas.df()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-6bcef7f7c748> in <module>()
      1 import uproot
      2 tree = uproot.open("Zmumu.root")["events"]
----> 3 tree.pandas.df()
      4

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/uproot/_connect/to_pandas.py in df(self, branches, entrystart, entrystop, flatten, cache, basketcache, keycache, executor, blocking)
     41     def df(self, branches=None, entrystart=None, entrystop=None, flatten=True, cache=None, basketcache=None, keycache=None, executor=None, blocking=True):
     42         import pandas
---> 43         return self._tree.arrays(branches=branches, outputtype=pandas.DataFrame, entrystart=entrystart, entrystop=entrystop, flatten=flatten, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking)

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/uproot/tree.py in arrays(self, branches, outputtype, entrystart, entrystop, flatten, cache, basketcache, keycache, executor, blocking)
    498         # if blocking, return the result of that function; otherwise, the function itself
    499         if blocking:
--> 500             return wait()
    501         else:
    502             return wait

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/uproot/tree.py in wait()
    451                             array = future()
    452
--> 453                             entries = numpy.empty(len(array.content), dtype=numpy.int64)
    454                             subentries = numpy.empty(len(array.content), dtype=numpy.int64)
    455                             starts, stops = array.starts, array.stops

AttributeError: 'Strings' object has no attribute 'content'
balarsen commented 6 years ago
47:balarsen@rbsp4 tests $ python --version
Python 3.6.5
18:balarsen@rbsp4 Downloads $ pip list
Package                    Version
-------------------------- ----------------------------
alabaster                  0.7.11
appnope                    0.1.0
arrow                      0.12.1
astropy                    3.0.3
atomicwrites               1.1.5
attrs                      18.1.0
Babel                      2.6.0
backcall                   0.1.0
BayesInst                  0.3.dev22+ga7d5c0d.d20180706
bleach                     2.1.3
certifi                    2018.4.16
chardet                    3.0.4
click                      6.7
coverage                   4.5.1
cycler                     0.10.0
Cython                     0.28.4
decorator                  4.3.0
docutils                   0.14
docx                       0.2.4
entrypoints                0.2.3
et-xmlfile                 1.0.1
gitdb2                     2.0.4
GitPython                  2.1.11
h5py                       2.8.0
html5lib                   1.0.1
hypothesis                 3.66.1
idna                       2.7
imagesize                  1.0.0
ipykernel                  4.8.2
ipython                    6.4.0
ipython-genutils           0.2.0
ipywidgets                 7.2.1
jdcal                      1.4
jedi                       0.12.1
Jinja2                     2.10
joblib                     0.12.0
jsonschema                 2.6.0
jupyter                    1.0.0
jupyter-client             5.2.3
jupyter-console            5.2.0
jupyter-core               4.4.0
kiwisolver                 1.0.1
LANLpygeometry             0.1
lxml                       4.2.3
MarkupSafe                 1.0
matplotlib                 2.2.2
mistune                    0.8.3
more-itertools             4.2.0
nbconvert                  5.3.1
nbformat                   4.4.0
nose                       1.3.7
notebook                   5.6.0
numexpr                    2.6.5
numpy                      1.14.5
openpyxl                   2.5.4
packaging                  17.1
pandas                     0.23.3
pandocfilters              1.4.2
parameterized              0.6.1
parso                      0.3.1
path.py                    11.0.1
patsy                      0.5.0
pbr                        4.1.0
pexpect                    4.6.0
pickleshare                0.7.4
Pillow                     5.2.0
pip                        10.0.1
pluggy                     0.6.0
prometheus-client          0.3.0
prompt-toolkit             1.0.15
ptyprocess                 0.6.0
py                         1.5.4
Pygments                   2.2.0
pymc3                      3.4.1
pyparsing                  2.2.0
pytest                     3.6.3
pytest-cov                 2.5.1
python-dateutil            2.7.3
pytz                       2018.5
pyzmq                      17.1.0
qtconsole                  4.3.1
requests                   2.19.1
ruamel.appconfig           0.5.4
ruamel.std.argparse        0.8.1
ruamel.std.pathlib         0.6.3
scikit-learn               0.19.1
scikit-optimize            0.5.2
scipy                      1.1.0
seaborn                    0.8.1
Send2Trash                 1.5.0
setuptools                 40.0.0
setuptools-scm             2.1.0
setuptools-scm-git-archive 1.0
simplegeneric              0.8.1
six                        1.11.0
sklearn                    0.0
smmap2                     2.0.4
snakefood                  1.4
snowballstemmer            1.2.1
spacepy                    0.1.6
Sphinx                     1.7.5
sphinx-git                 10.1.1
sphinxcontrib-websupport   1.1.0
stevedore                  1.28.0
STUDIO                     0.0.0
tables                     3.4.4
terminado                  0.8.1
testpath                   0.3.1
Theano                     1.0.2
tornado                    5.1
tqdm                       4.23.4
traitlets                  4.3.2
uproot                     2.9.4
urllib3                    1.23
version-information        1.0.3
virtualenv                 16.0.0
virtualenv-clone           0.3.0
virtualenvwrapper          4.8.2
wcwidth                    0.1.7
webencodings               0.5.1
widgetsnbextension         3.2.1
xarray                     0.10.7
xlwt                       1.3.0
xmltodict                  0.11.0
balarsen commented 6 years ago

I have no idea how to go about figuring out the issue, but adding

    def test_issue102(self):
        t = uproot.open("tests/samples/Zmumu.root")["events"]
        assert len(t.pandas.df(["pt1", "eta1", "phi1", "pt2", "eta2", "phi2"])) == 2304
        assert len(t.pandas.df()) == 2304

tests/test_issues.py

At least captures the error

jpivarski commented 6 years ago

Thanks for catching this— I'll fix it as soon as possible. It's not a mysterious bug— but it illustrates that I need to systemize some of the special case handing for different branch types. You're getting this error because some of your branches have string type and the DataFrame-handling code doesn't handle that case.

You can avoid this error (for now) by reading the data as arrays and converting them into a DataFrame:

df = pandas.DataFrame(columns=tree.allkeys(), data=tree.arrays())

The thing you would be missing by doing this is the ability to flatten jagged data into a DataFrame with a MultiIndex. (It's equivalent to flatyen=False).

I'll fix it as soon as possible (but it could be a week— on vacation).

balarsen commented 6 years ago

Perfect, no problem. Thanks for the hard work! Enjoy the vacation.

balarsen commented 6 years ago

An interesting observation on some 100M plain (not jagged etc) root files.

tree = uproot.open("./zep_hemisphere_1_95.root")['EventInfo']
%timeit df = pandas.DataFrame(columns=tree.allkeys(), data=tree.arrays())
# 1min 16s ± 552 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def makedf(tree):
         df = {}
         for k in tree.keys():
             df['k'] = tree[k].array()
         return pd.DataFrame(df)
%timeit makedf(tree)
# 44.1 s ± 539 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is a pretty notable difference in speed and seems like the first should be faster but maybe has other stuff behind the scenes that slow it down.

jpivarski commented 6 years ago

That's weird. The tree.arrays method is literally each branch.array in sequence when there is no executor (parallel processing). It internally creates functions and uses them once to accommodate parallel processing, which might be a slight hit that scales with the number of branches (not the size of their contents).

The other confounding variable here is constructing the DataFrame, which has performance characteristics that are mysterious to me. My prescription of setting columns and data is because it gives Pandas all the information at once (presumably, it can make use of that information to optimize) and columns sets the order. columns is the only Pandas difference between your two examples.

If it turns out to be array versus arrays, I'll look into it with more examples. I won't change it on the basis of one example because it might not be general, especially if it means complicating the code (separate parallel and sequential cases) for the sake of a speedup.

balarsen commented 6 years ago

And the plot thickens with not understanding how the heck data frames are constructed.

In my case pandas.DataFrame(tree.arrays()) and pandas.DataFrame(columns=tree.allkeys(), data=tree.arrays()) give the same result.

%timeit df = pandas.DataFrame(tree.arrays())
# 30.9 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jpivarski commented 6 years ago

(Seeing as I'm traveling with my family, I can't try these things myself, but I can keep giving you suggestions of things to try. I don't know, however, if your physics case really needs more performance on reading these DataFrames. As a library developer, I'm on the lookout for speedups, but if your focus is on physics, the difference between a minute and half a minute isn't that different.)

I suggested using columns because the order of the columns might be important. The dict returned by tree.arrays() might yield the same order, but not necessarily. It might also work to use an OrderedDict:

import collections
df = pandas.DataFrame(tree.arrays(outputtype=collections.OrderedDict))

When uproot fills an OrderedDict, it does so in the TTree's natural branch order.

But then again, maybe the column order doesn't matter to you. :)

balarsen commented 6 years ago

Thanks, I really only point these out for interesting things as a method to more fully understand what is at the bottom of the whole system. Seems like the actual bug is easy to fix when you return and that the rest is really file it away deep in the brain as a "oh I remember that" when it comes back as enhancements or someone's application requires more speed that currently is there. As you point out that is not me currently other than being nerd driven.

My particular love of this package is driven by moving away from root as early in my processing as possible and enabling me to use tools I am more comfortable with. I'm not a HEP guy but a space physics guy using geant for instrument responses.

jpivarski commented 6 years ago

I got up before everyone else and fixed the original bug that started this thread. I couldn't find the performance difference, but I don't have your file. Considering the changes that are in store for this bit of code, however, it might not be worth tuning it until after the awkward-arrays are in.