Closed jpivarski closed 3 weeks ago
Looks like support for steering Series.explode
from an extension array was only recently added to pandas! https://github.com/pandas-dev/pandas/pull/53602 Even then it looks like it's not extensively documented and was only added to Arrow types and not the extension array interface in general 🤔
In the latest released version of pandas (2.0.3) the Series.explode
method will just return a copy of itself if is_object_dtype(s)
returns False
(which in the case of s.dtype == "awkward"
we are not object dtype).
We can add the necessary method (_explode
) to our extension array, but Series.explode
won't dispatch to it. A workaround would be to convert the awkward type to an arrow type and then do the explode. (as of right now this would only work if you were working from the HEAD of the pandas repository)
Finally, we can also submit a patch upstream to pandas to see if we can get Series.explode
to support any extension type
An update here: opened https://github.com/pandas-dev/pandas/pull/54834 which has potential to go into pandas 2.2.0
Hi @douglasdavis thanks for this. I've run up against the same problem today, with a ROOT dataset imported using uproot. I see:
df['x'] ... Name: x, Length: 100, dtype: awkward
Can you clarify the process to 'convert to an arrow type and then do the explode'? Only some of my columns are awkward arrays...
Here's a trivial example. It starts with a simple root dataset (root -q mymacro.C
)
void mymacro() {
TFile *f = new TFile("myfile.root","recreate");
TTree *t = new TTree("mytree", "mytree");
std::vector<float> v_pt({0.1,0.2,0.3});
auto branch = t->Branch("pt", &v_pt);
for (int i =0 ; i < 3; i++)
{
v_pt.clear();
v_pt.push_back(0.1);
v_pt.push_back(0.2);
v_pt.push_back(0.3);
t->Fill();
}
t->Write();
f->Close();
}
then try using uproot to read the thing back:
Is there any easy way to make this work? Thanks so much in advance!
I should say, I'm currently on pandas 2.0.0
Can you perhaps phrase this in terms of awkward and pandas alone, so that we can make a simple test function of expected functionality?
Is https://github.com/intake/awkward-pandas/pull/46 perhaps exactly what you need?
Was just going to add I see exactly the same behaviour with pandas 2.2.1. I'll have a crack at rephrasing this without root
@martindurant https://github.com/intake/awkward-pandas/pull/46 looks very relevant, but I'm not really sure how to apply it to my dataframe to enable the explosion
You would need to install from that branch, and it should "just work". At least, I think - from the screenshots, I'm not certain what the expected output would be.
Interesting. This works fine (with pandas 2.2.1) if I create the dataframe of awkward arrays myself:
But the difference is that pt
now gets a dtype object
instead of awkward
Given that the output is numbers, its type should probably be int/float. I suppose type awkward would be OK too and maybe the easiest to apply consistently. Can you please include your snippet in the PR, we'll make it a test and make sure we fix it.
Very sorry to ask a potentially dump question, @martindurant , but how do I install from that branch? Would that be something like:
pip install git+https://github.com/douglasdavis/awkward-pandas/tree/dev-explode
?
OK I figured out how to do it (pip install git+https://github.com/douglasdavis/awkward-pandas@dev-explode), and here is the result!
Thanks a lot for the pointers, @martindurant !!!
I'm forwarding this Gitter question from @gozmit97 to make sure that it doesn't scroll away without getting answered. It sounds like it could be an issue.
I'm trying to expand the arrays in the columns of my dataframe, in the style of this stackoverflow page using explode. Using a sample code like the following (converted to Awkward arrays yo match my data):
I'm able to explode all my columns using
[df[col].explode(ignore_index=True) for col in df]
as wanted. However, running it on my own data seems to do nothing. Upon a bit of snooping around, the only noticeable difference I've been able to find between the series in sample code above (say in column2) and in my data is that my data hasdtype=Awkward
and the sample code hasdtype=object
. (see below for samples, first being sample code and second being my image with changed data)If you have any ideas as to what may be happening here or suggestions on how I might hope to turn each of my rows with arrays into one row for each element in that array in another manner, do let me know. pd.explode worked prior but after changing how I saved and loaded my data it stopped working.