scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
239 stars 76 forks source link

not able to create pandas dataframe #1329

Closed sv3048 closed 3 weeks ago

sv3048 commented 3 weeks ago

Hi developers !

I am creating a pandas dataframe but issue is its giving me tuple each time.

AttributeError Traceback (most recent call last) /tmp/ipykernel_557/1180174254.py in 2 import pandas as pd 3 ----> 4 dfall.columns

AttributeError: 'tuple' object has no attribute 'columns'

Here is the code that I run for making pandas dataframe.

filename = "/eos/cms/store/group/phys_higgs/cmshww/amassiro/HWWNano/Summer20UL18_106x_nAODv9_Full2018v9/MCl1loose2018v9__MCCorr2018v9NoJERInHorn__MCCombJJLNu2018/nanoLatino_GluGluToWWToQQ_Sig_private__part9.root"
file = uproot.open(filename)

# show what is inside the root file loaded from uproot
print(file.classnames())
print(file.keys())

tree = file["Events"]  # select the TTree inside the root file
tree.show()  # show all the branches inside the TTree
dfall = tree.arrays(library="pd")  # convert uproot TTree into pandas dataframe
#dfall.columns
print("type of dfall", type(dfall))
print("============================================")
print("File loaded with ", len(dfall), " events ")

Thanks, Sadhana

jpivarski commented 3 weeks ago

Uproot 4.x tries to "explode" ragged data, so that an array of variable numbers of particles per event are turned into a DataFrame with numbers in the cells and MultiIndex rows, indicating the nesting, similar to ak.to_dataframe.

But this isn't always possible. If you are trying to read, for instance, both muons and electrons, the numbers of particles in these two collections are not in general (or even usually) equal to each other, so there's no single MultiIndex that they can both expand to. In that case, Uproot 4.x produces a tuple of DataFrames, one for each particle type.

Uproot 5.x, however, uses Akimbo to put lists of numbers into each cell (instead of individual numbers) with a normal index. That's because dataframe libraries are starting to use Arrow format, rather than Python lists, and it's not a big performance loss to do so.

So you have two options: 1. without installing any new packages, select (with expressions or filter_names) a single type of particle from the ROOT file. You can do additional calls to get the other particle types into other DataFrames. Or 2. upgrade to the latest version of Uproot.