scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
233 stars 73 forks source link

arrays(library="pd") returns a tuple #583

Closed dcervenkov closed 2 years ago

dcervenkov commented 2 years ago

I have a ROOT file that I want to read using uproot and export a pandas DataFrame. The command I'm using is

df = uproot.open("dvntuple.root")["Hlt2Dstp2D0Pip_D02KmKpPimPip_Tuple/DecayTree"].arrays(library="pd", entry_stop=8)

However, df is a tuple rather than pandas.core.frame.DataFrame. It's a tuple of two different DataFrames - df[0] has 8 rows and 893 columns, df[1] has 9 rows and 804 columns.

Is this expected?

I couldn't create a smaller test file as the file created by running the following command doesn't cause this behavior.

rooteventselector -l10 dvntuple.root:Hlt2Dstp2D0Pip_D02KmKpPimPip_Tuple/DecayTree mwe.root

The original largish file (1.7 GB) is available here.

uproot 4.2.2

jpivarski commented 2 years ago

This is intentional; it's because the TBranches you're asking for are jagged with different length lists in the arrays. You're asking for all TBranches because you don't specify them. "Differently jagged" arrays can't be put into a single DataFrame because we represent the jaggedness in DataFrames via a MultiIndex, and a DataFrame can only have one index.

There's a more complete explanation in Discussion https://github.com/scikit-hep/uproot4/discussions/114.

dcervenkov commented 2 years ago

Excellent, thanks for the link to the more detailed discussion. However, after reading the discussion, I'm unsure what differences to expect between the individual DataFrames in the tuple, how many DataFrames to expect, and how many columns should each of them have.

jpivarski commented 2 years ago

You'll get the minimum possible number of DataFrames: all TBranches with alignable jaggedness will be in the same DataFrame, and they won't be in any of the other DataFrames (they'll be exclusively partitioned among DataFrames).

The non-jagged TBranches will be in all of the DataFrames, since their values need to get duplicated to make them fit into each of the DataFrames, and it's a different duplication in each.

If you want more control over the situation, specify an explicit list of TBranches to read. (You'll want that when you scale up to large datasets, anyway, since you'll need to start controlling memory use.) The TBranches that can all go in one DataFrame will be the ones that all correspond to the same particle: you'll have a muon DataFrame, an electron DataFrame, a jet DataFrame, etc. If you're selecting TBranches that you think all have the same jaggedness, such as all the muon attributes, but it returns more than one DataFrame, then they don't really have the same jaggedness and you've discovered a bug in your data preparation. (It's happened before, I can't find the link, but somebody was surprised by how the DataFrames split up and it turned out that a cut applied to one attribute was slightly different from the cut applied to all others. So it's good information!)

If this sounds cumbersome, you might want to avoid using Pandas. Pandas's range of data structures is not a good fit for general HEP data—we support it in Uproot because it's good for some simple cases. Ask yourself, for instance, how you're going to do combinatorics with the data in DataFrames, or some other step that you're going to need to do. All of the steps of applying cuts and making plots are exactly as easy to do outside of Pandas as within it. What Pandas adds is the ability to do SQL-like joins, which we don't do often in HEP.

dcervenkov commented 2 years ago

Thanks for the detailed reply. I did not produce the ROOT file and I think I need to look into how it is made to understand what jaggedness it has. A way to see which branches are jagged would help; does this exist?

I don't understand the argument why the non-jagged branches need to be in all DataFrames - I think that is the bit that confuses most users (judging by past issues). To me, it would make more sense if all the non-jagged entries were in one DataFrame and each jagged group in a separate DataFrame.

Yep, I'll definitely load just a subset of branches for proper analysis, I was just surprised by the tuple of datasets when exploring.

I'm not adamant about using Pandas; your Awkward Arrays look very interesting in particular, however, the documentation has a lot of TODOs at this point, which is what's stopping me.

jpivarski commented 2 years ago

A way to see which branches are jagged would help; does this exist?

A TBranch's typename has the C++ name of the type, though it might be more useful to look at the interpretation. Different C++ types, like float[] and std::vector<float>, are both AsJagged(AsDtype("float32")).

For a quick look, you can get an ASCII table of all of them by calling TTree's show method.

I don't understand the argument why the non-jagged branches need to be in all DataFrames - I think that is the bit that confuses most users (judging by past issues). To me, it would make more sense if all the non-jagged entries were in one DataFrame and each jagged group in a separate DataFrame.

The non-jagged data are in the DataFrames so that they can be related to the jagged data. It's reasonable, for instance, to want to know which event number or MET value to associate with a given muon or jet. If they were separate, you'd have to use a Pandas merge to duplicate them appropriately, as in this minimal example:

>>> import awkward as ak
>>> array = ak.Array([{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}])
>>> ak.to_pandas(array)
                  x  y
entry subentry        
0     0         1.1  1
1     0         2.2  1
      1         2.2  2
2     0         3.3  1
      1         3.3  2
      2         3.3  3

This is a matter of defaults. Uproot (and Awkward Array) combine as much as possible into a single DataFrame, but not any more than is possible (the original question). A different default would be to keep the non-jagged data separate from the jagged data to make users have to do the merge manually. With that default, this function would almost always return a tuple because most TTrees have both jagged and non-jagged data. A more extreme default would be to return every TBranch separately (as Series) and make the user put everything together, but that would be nearly the same thing as library="np".

To deviate from the default, select sets of TBranches explicitly. Here's an example that selects just the non-jagged (AsDtype) ones:

>>> import uproot, skhep_testdata
>>> tree = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]

>>> # check your filter in tree.keys to see what it gives you, without reading data
>>> tree.keys(filter_branch=lambda branch: isinstance(branch.interpretation, uproot.AsDtype))
['NJet', 'NMuon', 'NElectron', 'NPhoton', 'MET_px', 'MET_py', 'MChadronicBottom_px', 'MChadronicBottom_py',
 'MChadronicBottom_pz', 'MCleptonicBottom_px', 'MCleptonicBottom_py', 'MCleptonicBottom_pz',
 'MChadronicWDecayQuark_px', 'MChadronicWDecayQuark_py', 'MChadronicWDecayQuark_pz',
 'MChadronicWDecayQuarkBar_px', 'MChadronicWDecayQuarkBar_py', 'MChadronicWDecayQuarkBar_pz',
 'MClepton_px', 'MClepton_py', 'MClepton_pz', 'MCleptonPDGid', 'MCneutrino_px', 'MCneutrino_py',
 'MCneutrino_pz', 'NPrimaryVertices', 'triggerIsoMu24', 'EventWeight']

>>> # if it's good, apply it to tree.arrays to read the data
>>> tree.arrays(filter_branch=lambda branch: isinstance(branch.interpretation, uproot.AsDtype), library="pd")
      NJet  NMuon  NElectron  ...  NPrimaryVertices  triggerIsoMu24  EventWeight
0        0      2          0  ...                 6            True     0.009271
1        1      1          0  ...                18            True     0.000331
2        0      2          0  ...                16            True     0.005080
3        3      2          0  ...                 8            True     0.007081
4        2      2          2  ...                 2            True     0.008536
...    ...    ...        ...  ...               ...             ...          ...
2416     1      1          0  ...                 9            True     0.009260
2417     2      1          0  ...                18            True     0.000331
2418     1      1          0  ...                 9            True     0.004153
2419     2      1          0  ...                 6            True     0.008829
2420     0      1          0  ...                12            True     0.008755

[2421 rows x 28 columns]

I'm not adamant about using Pandas; your Awkward Arrays look very interesting in particular, however, the documentation has a lot of TODOs at this point, which is what's stopping me.

I need to just remove the pages that have TODOs in them. There's a lot of documentation and the project's stable—I was just overeager in how much I thought I would have time to write about. I got the same comment here; the TODOs are giving the wrong impression.

dcervenkov commented 2 years ago

Excellent, the show() method and branch filtering based on interpretation give me everything I need to do what I want. Thanks for the tips and helpful examples!

I understand your reasoning about the defaults, and while not what I'd choose, it's a fair choice. It'd be great if the workaround you posted here using the branch filtering based on interpretation could be mentioned in the documentation.

Regarding Awkward documentation - please don't remove the pages! The documentation overview looks great, and I'm aware that the project is relatively mature, but it is difficult to use it instead of, e.g., Pandas, if the documentation is not there. My first check was how to filter the data quickly, e.g., retain only events where col1 > 3 and col2 < 0. Then I saw that the page doesn't exist yet, which means I'm back to Pandas for exploratory analysis (its query() is most useful and easy to use).

jpivarski commented 2 years ago

My first check was how to filter the data quickly, e.g., retain only events where col1 > 3 and col2 < 0. Then I saw that the page doesn't exist yet, which means I'm back to Pandas for exploratory analysis (its query() is most useful and easy to use).

Because that information is here: https://awkward-array.readthedocs.io/en/latest/_auto/ak.Array.html#filtering

The existence of a page that says "TODO" gives the impression that this information hasn't been written down, rather than being written down in another place. What I'm intending to do (c.f. https://github.com/scikit-hep/awkward-1.0/issues/1366#issuecomment-1065256551) is simplify its organization so that readers are funneled into the places where the things they're looking for have been written.

dcervenkov commented 2 years ago

Aha! Well, I did not realize that the documentation is in two places. Perhaps adding links to the old pages on the unfinished new pages could be a quick stop-gap solution. I'll comment on the issue you linked.