Include but don't flatten vectors from root branches

scikit-hep / root_pandas

A Python module for conveniently loading/saving ROOT files as pandas DataFrames

MIT License

109 stars 35 forks source link

Include but don't flatten vectors from root branches #46

Closed lunik1 closed 6 years ago

lunik1 commented 6 years ago

Is it possible to read a branch with entries with a vectors, but store them as entries which are arrays in the dataframe rather then duplicating rows and anding the __array_index column?

chrisburr commented 6 years ago

Unfortunately pandas doesn't support this[1]. There is the option to instead implement it as a multi-index which might be cleaner, but it still results in duplication.

[1] Technically there is a Panel class for building a 3D/4D dataframe-like object however this has been deprecated in favour of xarray.

lunik1 commented 6 years ago

Hi,

As a quick example I was envisioning something like

import numpy as np
import pandas as pd

d = {"col1": [5, 6, 7],
     "col2": [np.array([1, 2 ,3]), np.array([0]), np.array([4, 4, 4, 5])]}

df = pd.DataFrame(data=d)

which gives a dataframe

   col1          col2
0     5     [1, 2, 3]
1     6           [0]
2     7  [4, 4, 4, 5]

When reading in from ROOT, col2 would contain the cohesive arrays from TTrees which would ordinarily be split into different entries by the flattening process.

jonas-eschle commented 6 years ago

@chrisburr I think that pandas does fully support this (well, it's not a real "multi-dimensional DataFrame" that is what is wanted here, I think) actually even root_numpy does :)

@lunik1 it is as easy as:

ar1 = root_numpy.root2array(filenames="rootfile.root", treename='tree', branches=['branch_with_array'])
df = pd.DataFrame(ar1)

will give you a perfect DataFrame, even the columns are named in the correct way.

(I personally find handling of arrays to be very unfortunate in root_pandas, the main reason I still stick with root_numpy...)

I would be very interested as well if root_pandas would not forbid that as it currently limits itself actively (it tests and does not allow an array for conversion, as far as I remember)

So any plans in this direction?

chrisburr commented 6 years ago

Seems I'm outdated/incorrect about the functionality of both pandas and root_numpy.

This could be much nicer, I'll look into it.

chrisburr commented 6 years ago

@lunik1 @mayou36

Thanks for those snippets, root_pandas now supports this as of v0.3.1.

It turns out pandas doesn't support multidimensional data. You can however store arbitrary python objects inside dataframes, such as numpy arrays. This works out of the box for jagged arrays, however it takes a little bit more work if the array is square (which root_pandas now does automatically). For example:

scalar = np.random.random(5)
array = np.random.random((5,3))
jagged_array = np.array([np.random.random(i) for i in range(5)])

# This works
df = pd.DataFrame({'a': scalar, 'b': jagged_array})

# This raises "Exception: Data must be 1-dimensional"
df = pd.DataFrame({'a': scalar, 'b': array})

# This also works
new_array = np.zeros(len(array), dtype='O')
for i, row in enumerate(array):
    new_array[i] = row
df = pd.DataFrame({'a': scalar, 'b': new_array})