intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
21 stars 6 forks source link

Is `explode` working properly? #38

Open jpivarski opened 11 months ago

jpivarski commented 11 months ago

I'm forwarding this Gitter question from @gozmit97 to make sure that it doesn't scroll away without getting answered. It sounds like it could be an issue.


I'm trying to expand the arrays in the columns of my dataframe, in the style of this stackoverflow page using explode. Using a sample code like the following (converted to Awkward arrays yo match my data):

df = pd.DataFrame.from_records(
    data=ak.Array(np.random.random((2, 3, 2)).round(3)),
).add_prefix("column")

print(df)

I'm able to explode all my columns using [df[col].explode(ignore_index=True) for col in df] as wanted. However, running it on my own data seems to do nothing. Upon a bit of snooping around, the only noticeable difference I've been able to find between the series in sample code above (say in column2) and in my data is that my data has dtype=Awkward and the sample code has dtype=object. (see below for samples, first being sample code and second being my image with changed data)

0    [0.675, 0.485]
1    [0.317, 0.865]
Name: column2, dtype: object

0  [123.456, 789.000]
Name: A, dtype: awkward

If you have any ideas as to what may be happening here or suggestions on how I might hope to turn each of my rows with arrays into one row for each element in that array in another manner, do let me know. pd.explode worked prior but after changing how I saved and loaded my data it stopped working.

douglasdavis commented 11 months ago

Looks like support for steering Series.explode from an extension array was only recently added to pandas! https://github.com/pandas-dev/pandas/pull/53602 Even then it looks like it's not extensively documented and was only added to Arrow types and not the extension array interface in general 🤔

In the latest released version of pandas (2.0.3) the Series.explode method will just return a copy of itself if is_object_dtype(s) returns False (which in the case of s.dtype == "awkward" we are not object dtype).

We can add the necessary method (_explode) to our extension array, but Series.explode won't dispatch to it. A workaround would be to convert the awkward type to an arrow type and then do the explode. (as of right now this would only work if you were working from the HEAD of the pandas repository)

Finally, we can also submit a patch upstream to pandas to see if we can get Series.explode to support any extension type

douglasdavis commented 10 months ago

An update here: opened https://github.com/pandas-dev/pandas/pull/54834 which has potential to go into pandas 2.2.0

CrfzdPQM6 commented 3 months ago

Hi @douglasdavis thanks for this. I've run up against the same problem today, with a ROOT dataset imported using uproot. I see:

df['x'] ... Name: x, Length: 100, dtype: awkward

Can you clarify the process to 'convert to an arrow type and then do the explode'? Only some of my columns are awkward arrays...

CrfzdPQM6 commented 3 months ago

Here's a trivial example. It starts with a simple root dataset (root -q mymacro.C)

void mymacro() {
  TFile *f = new TFile("myfile.root","recreate");
  TTree *t = new TTree("mytree", "mytree");
  std::vector<float> v_pt({0.1,0.2,0.3});

  auto branch = t->Branch("pt", &v_pt);

  for (int i =0 ; i < 3; i++)
  {
    v_pt.clear();
    v_pt.push_back(0.1);
    v_pt.push_back(0.2);
    v_pt.push_back(0.3);
    t->Fill();
  }
  t->Write();
  f->Close();
}

then try using uproot to read the thing back: screenshot_2024-04-08_18-24-35_206802240

CrfzdPQM6 commented 3 months ago

Is there any easy way to make this work? Thanks so much in advance!

CrfzdPQM6 commented 3 months ago

I should say, I'm currently on pandas 2.0.0

martindurant commented 3 months ago

Can you perhaps phrase this in terms of awkward and pandas alone, so that we can make a simple test function of expected functionality?

Is https://github.com/intake/awkward-pandas/pull/46 perhaps exactly what you need?

CrfzdPQM6 commented 3 months ago

Was just going to add I see exactly the same behaviour with pandas 2.2.1. I'll have a crack at rephrasing this without root screenshot_2024-04-08_18-30-56_092084025

CrfzdPQM6 commented 3 months ago

@martindurant https://github.com/intake/awkward-pandas/pull/46 looks very relevant, but I'm not really sure how to apply it to my dataframe to enable the explosion

martindurant commented 3 months ago

You would need to install from that branch, and it should "just work". At least, I think - from the screenshots, I'm not certain what the expected output would be.

CrfzdPQM6 commented 3 months ago

Interesting. This works fine (with pandas 2.2.1) if I create the dataframe of awkward arrays myself: screenshot_2024-04-08_18-38-47_669933957

CrfzdPQM6 commented 3 months ago

But the difference is that pt now gets a dtype object instead of awkward

martindurant commented 3 months ago

Given that the output is numbers, its type should probably be int/float. I suppose type awkward would be OK too and maybe the easiest to apply consistently. Can you please include your snippet in the PR, we'll make it a test and make sure we fix it.

CrfzdPQM6 commented 3 months ago

Very sorry to ask a potentially dump question, @martindurant , but how do I install from that branch? Would that be something like:

pip install git+https://github.com/douglasdavis/awkward-pandas/tree/dev-explode

?

CrfzdPQM6 commented 3 months ago

OK I figured out how to do it (pip install git+https://github.com/douglasdavis/awkward-pandas@dev-explode), and here is the result! screenshot_2024-04-08_18-52-01_937420805

CrfzdPQM6 commented 3 months ago

Thanks a lot for the pointers, @martindurant !!!