Closed balarsen closed 6 years ago
47:balarsen@rbsp4 tests $ python --version
Python 3.6.5
18:balarsen@rbsp4 Downloads $ pip list
Package Version
-------------------------- ----------------------------
alabaster 0.7.11
appnope 0.1.0
arrow 0.12.1
astropy 3.0.3
atomicwrites 1.1.5
attrs 18.1.0
Babel 2.6.0
backcall 0.1.0
BayesInst 0.3.dev22+ga7d5c0d.d20180706
bleach 2.1.3
certifi 2018.4.16
chardet 3.0.4
click 6.7
coverage 4.5.1
cycler 0.10.0
Cython 0.28.4
decorator 4.3.0
docutils 0.14
docx 0.2.4
entrypoints 0.2.3
et-xmlfile 1.0.1
gitdb2 2.0.4
GitPython 2.1.11
h5py 2.8.0
html5lib 1.0.1
hypothesis 3.66.1
idna 2.7
imagesize 1.0.0
ipykernel 4.8.2
ipython 6.4.0
ipython-genutils 0.2.0
ipywidgets 7.2.1
jdcal 1.4
jedi 0.12.1
Jinja2 2.10
joblib 0.12.0
jsonschema 2.6.0
jupyter 1.0.0
jupyter-client 5.2.3
jupyter-console 5.2.0
jupyter-core 4.4.0
kiwisolver 1.0.1
LANLpygeometry 0.1
lxml 4.2.3
MarkupSafe 1.0
matplotlib 2.2.2
mistune 0.8.3
more-itertools 4.2.0
nbconvert 5.3.1
nbformat 4.4.0
nose 1.3.7
notebook 5.6.0
numexpr 2.6.5
numpy 1.14.5
openpyxl 2.5.4
packaging 17.1
pandas 0.23.3
pandocfilters 1.4.2
parameterized 0.6.1
parso 0.3.1
path.py 11.0.1
patsy 0.5.0
pbr 4.1.0
pexpect 4.6.0
pickleshare 0.7.4
Pillow 5.2.0
pip 10.0.1
pluggy 0.6.0
prometheus-client 0.3.0
prompt-toolkit 1.0.15
ptyprocess 0.6.0
py 1.5.4
Pygments 2.2.0
pymc3 3.4.1
pyparsing 2.2.0
pytest 3.6.3
pytest-cov 2.5.1
python-dateutil 2.7.3
pytz 2018.5
pyzmq 17.1.0
qtconsole 4.3.1
requests 2.19.1
ruamel.appconfig 0.5.4
ruamel.std.argparse 0.8.1
ruamel.std.pathlib 0.6.3
scikit-learn 0.19.1
scikit-optimize 0.5.2
scipy 1.1.0
seaborn 0.8.1
Send2Trash 1.5.0
setuptools 40.0.0
setuptools-scm 2.1.0
setuptools-scm-git-archive 1.0
simplegeneric 0.8.1
six 1.11.0
sklearn 0.0
smmap2 2.0.4
snakefood 1.4
snowballstemmer 1.2.1
spacepy 0.1.6
Sphinx 1.7.5
sphinx-git 10.1.1
sphinxcontrib-websupport 1.1.0
stevedore 1.28.0
STUDIO 0.0.0
tables 3.4.4
terminado 0.8.1
testpath 0.3.1
Theano 1.0.2
tornado 5.1
tqdm 4.23.4
traitlets 4.3.2
uproot 2.9.4
urllib3 1.23
version-information 1.0.3
virtualenv 16.0.0
virtualenv-clone 0.3.0
virtualenvwrapper 4.8.2
wcwidth 0.1.7
webencodings 0.5.1
widgetsnbextension 3.2.1
xarray 0.10.7
xlwt 1.3.0
xmltodict 0.11.0
I have no idea how to go about figuring out the issue, but adding
def test_issue102(self):
t = uproot.open("tests/samples/Zmumu.root")["events"]
assert len(t.pandas.df(["pt1", "eta1", "phi1", "pt2", "eta2", "phi2"])) == 2304
assert len(t.pandas.df()) == 2304
tests/test_issues.py
At least captures the error
Thanks for catching this— I'll fix it as soon as possible. It's not a mysterious bug— but it illustrates that I need to systemize some of the special case handing for different branch types. You're getting this error because some of your branches have string type and the DataFrame-handling code doesn't handle that case.
You can avoid this error (for now) by reading the data as arrays and converting them into a DataFrame:
df = pandas.DataFrame(columns=tree.allkeys(), data=tree.arrays())
The thing you would be missing by doing this is the ability to flatten jagged data into a DataFrame with a MultiIndex. (It's equivalent to flatyen=False
).
I'll fix it as soon as possible (but it could be a week— on vacation).
Perfect, no problem. Thanks for the hard work! Enjoy the vacation.
An interesting observation on some 100M plain (not jagged etc) root files.
tree = uproot.open("./zep_hemisphere_1_95.root")['EventInfo']
%timeit df = pandas.DataFrame(columns=tree.allkeys(), data=tree.arrays())
# 1min 16s ± 552 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def makedf(tree):
df = {}
for k in tree.keys():
df['k'] = tree[k].array()
return pd.DataFrame(df)
%timeit makedf(tree)
# 44.1 s ± 539 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is a pretty notable difference in speed and seems like the first should be faster but maybe has other stuff behind the scenes that slow it down.
That's weird. The tree.arrays
method is literally each branch.array
in sequence when there is no executor (parallel processing). It internally creates functions and uses them once to accommodate parallel processing, which might be a slight hit that scales with the number of branches (not the size of their contents).
The other confounding variable here is constructing the DataFrame, which has performance characteristics that are mysterious to me. My prescription of setting columns
and data
is because it gives Pandas all the information at once (presumably, it can make use of that information to optimize) and columns
sets the order. columns
is the only Pandas difference between your two examples.
If it turns out to be array
versus arrays
, I'll look into it with more examples. I won't change it on the basis of one example because it might not be general, especially if it means complicating the code (separate parallel and sequential cases) for the sake of a speedup.
And the plot thickens with not understanding how the heck data frames are constructed.
In my case pandas.DataFrame(tree.arrays())
and pandas.DataFrame(columns=tree.allkeys(), data=tree.arrays())
give the same result.
%timeit df = pandas.DataFrame(tree.arrays())
# 30.9 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(Seeing as I'm traveling with my family, I can't try these things myself, but I can keep giving you suggestions of things to try. I don't know, however, if your physics case really needs more performance on reading these DataFrames. As a library developer, I'm on the lookout for speedups, but if your focus is on physics, the difference between a minute and half a minute isn't that different.)
I suggested using columns
because the order of the columns might be important. The dict returned by tree.arrays()
might yield the same order, but not necessarily. It might also work to use an OrderedDict:
import collections
df = pandas.DataFrame(tree.arrays(outputtype=collections.OrderedDict))
When uproot fills an OrderedDict, it does so in the TTree's natural branch order.
But then again, maybe the column order doesn't matter to you. :)
Thanks, I really only point these out for interesting things as a method to more fully understand what is at the bottom of the whole system. Seems like the actual bug is easy to fix when you return and that the rest is really file it away deep in the brain as a "oh I remember that" when it comes back as enhancements or someone's application requires more speed that currently is there. As you point out that is not me currently other than being nerd driven.
My particular love of this package is driven by moving away from root as early in my processing as possible and enabling me to use tools I am more comfortable with. I'm not a HEP guy but a space physics guy using geant for instrument responses.
I got up before everyone else and fixed the original bug that started this thread. I couldn't find the performance difference, but I don't have your file. Considering the changes that are in store for this bit of code, however, it might not be worth tuning it until after the awkward-arrays are in.
This seems like it crept in recently, as it used to work.
... works