scikit-hep / root_pandas

A Python module for conveniently loading/saving ROOT files as pandas DataFrames
MIT License
109 stars 35 forks source link

root_pandas randomly shuffles index of columns #81

Closed zhangzc11 closed 5 years ago

zhangzc11 commented 5 years ago

I recently realized that when constructing DataFrame from root_pandas.read_root, the index of the columns get randomly shuffled. Try the following:

wget http://scikit-hep.org/uproot/examples/HZZ.root

here is the test.py code:

#!/usr/bin/env python
import uproot
import root_pandas as rp
variables = ['MET_px', 'MET_py', 'EventWeight']
df=rp.read_root('HZZ.root', 'events', columns=variables)
events = uproot.open("HZZ.root")["events"]
df2=events.pandas.df(variables, flatten=False)
print(df.values[0])
print(df2.values[0])

So if you run this test.py code multiple times, you will see that the print out result from root_pandas DataFrama (df) changes; but the DataFrame from uproot (df2) is always the same (and follows the order of TBranch name lists).

zhicaiz@zhicaiz ~$ python test.py
[2.5636332  5.912771   0.00927101]
[5.912771   2.5636332  0.00927101]
zhicaiz@zhicaiz ~$ python test.py
[0.00927101 2.5636332  5.912771  ]
[5.912771   2.5636332  0.00927101]

root_pandas version i used: v0.6.1

chrisburr commented 5 years ago

Thanks for the clear reproducer, it will be fixed in 0.7.0 which is currently working its way through the pipeline.

The order is still ambiguous if wildcards or expansions are used, for example asking for ['MET_*', 'EventWeight', 'MET_px'] will now result in ['EventWeight', 'MET_px', 'MET_py'] (defined columns first, wildcards afterwards). Is this a problem for your use case?