Closed lunik1 closed 6 years ago
Unfortunately pandas doesn't support this[1]. There is the option to instead implement it as a multi-index which might be cleaner, but it still results in duplication.
[1] Technically there is a Panel
class for building a 3D/4D dataframe-like object however this has been deprecated in favour of xarray.
Hi,
As a quick example I was envisioning something like
import numpy as np
import pandas as pd
d = {"col1": [5, 6, 7],
"col2": [np.array([1, 2 ,3]), np.array([0]), np.array([4, 4, 4, 5])]}
df = pd.DataFrame(data=d)
which gives a dataframe
col1 col2
0 5 [1, 2, 3]
1 6 [0]
2 7 [4, 4, 4, 5]
When reading in from ROOT, col2 would contain the cohesive arrays from TTrees which would ordinarily be split into different entries by the flattening process.
@chrisburr I think that pandas does fully support this (well, it's not a real "multi-dimensional DataFrame" that is what is wanted here, I think) actually even root_numpy does :)
@lunik1 it is as easy as:
ar1 = root_numpy.root2array(filenames="rootfile.root", treename='tree', branches=['branch_with_array'])
df = pd.DataFrame(ar1)
will give you a perfect DataFrame, even the columns are named in the correct way.
(I personally find handling of arrays to be very unfortunate in root_pandas, the main reason I still stick with root_numpy...)
I would be very interested as well if root_pandas would not forbid that as it currently limits itself actively (it tests and does not allow an array for conversion, as far as I remember)
So any plans in this direction?
Seems I'm outdated/incorrect about the functionality of both pandas and root_numpy.
This could be much nicer, I'll look into it.
@lunik1 @mayou36
Thanks for those snippets, root_pandas
now supports this as of v0.3.1.
It turns out pandas
doesn't support multidimensional data. You can however store arbitrary python objects inside dataframes, such as numpy arrays. This works out of the box for jagged arrays, however it takes a little bit more work if the array is square (which root_pandas
now does automatically). For example:
scalar = np.random.random(5)
array = np.random.random((5,3))
jagged_array = np.array([np.random.random(i) for i in range(5)])
# This works
df = pd.DataFrame({'a': scalar, 'b': jagged_array})
# This raises "Exception: Data must be 1-dimensional"
df = pd.DataFrame({'a': scalar, 'b': array})
# This also works
new_array = np.zeros(len(array), dtype='O')
for i, row in enumerate(array):
new_array[i] = row
df = pd.DataFrame({'a': scalar, 'b': new_array})
Is it possible to read a branch with entries with a vectors, but store them as entries which are arrays in the dataframe rather then duplicating rows and anding the __array_index column?