scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
238 stars 77 forks source link

known_base_form for uproot.open #1137

Closed ekourlit closed 8 months ago

ekourlit commented 8 months ago

I recently realised there is the known_base_form argument for the uproot.dask, would it make sense to add it to the uproot.open as well in order to accelerate the opening of similar files?

Relevant documentation: https://uproot.readthedocs.io/en/latest/uproot._dask.dask.html#uproot._dask.dask

In ATLAS we have a common data format, the PHYSLITE, and analysers usually need to open O(1000) of identical in metadata and structure files. Thus, if along the files we provide the unique form we could potentially accelerate the I/O.

Tagging @jackharrison111 who's is working on a project limited by I/O.

agoose77 commented 8 months ago

Hi @ekourlit, I'm not sure that known_base_form will provide much use for eager uproot.open usage. In Dask, it is useful because it allows us to skip opening the file at graph-building time to figure out the tree metadata that's required. However, at runtime, it offers no direct performance boost to the actual reading of files (apart from the fact that we use knowledge of the tree form and the operations that the user will perform upon it to reduce the number of branches that are read, but that's just a Dask-specific optimisation, and does not require providing the form explicitly).

Whether we can improve things here in the eager mode is not something I know much about. @jpivarski has discussed this before and I'm sure could drop a kernel of information.

What I will say is that it sounds like you'd benefit from the dask-awkward integration that makes it possible for us to read less, as an alternative approach to reading more quickly.

jpivarski commented 8 months ago

This won't do anything for eager reading, as @agoose77 pointed out. In eager mode, it's equivalent to taking the output of Uproot and feeding it through ak.zip, like this:

>>> import skhep_testdata
>>> import uproot
>>> import awkward as ak
>>> import vector
>>> vector.register_awkward()
>>>
>>> tree = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]
>>> arrays = tree.arrays(filter_name=["Electron_*", "Muon_*"])
>>> arrays.type.show()
2421 * {
    Muon_Px: var * float32,
    Muon_Py: var * float32,
    Muon_Pz: var * float32,
    Muon_E: var * float32,
    Muon_Charge: var * int32,
    Muon_Iso: var * float32,
    Electron_Px: var * float32,
    Electron_Py: var * float32,
    Electron_Pz: var * float32,
    Electron_E: var * float32,
    Electron_Charge: var * int32,
    Electron_Iso: var * float32
}
>>>
>>> restructured = ak.zip({
...     "muon": ak.zip({
...         "px": arrays.Muon_Px,
...         "py": arrays.Muon_Py,
...         "pz": arrays.Muon_Pz,
...         "E": arrays.Muon_E,
...         "charge": arrays.Muon_Charge,
...         "iso": arrays.Muon_Iso,
...     }, with_name="Momentum4D"),
...     "electron": ak.zip({
...         "px": arrays.Electron_Px,
...         "py": arrays.Electron_Py,
...         "pz": arrays.Electron_Pz,
...         "E": arrays.Electron_E,
...         "charge": arrays.Electron_Charge,
...         "iso": arrays.Electron_Iso,
...     }, with_name="Momentum4D"),
...     },
...     depth_limit=1,
... )
>>>
>>> restructured.muon.pt
<Array [[54.2, 37.7], [24.4], ..., [63.6], [42.9]] type='2421 * var * float32'>
>>> restructured.muon.eta
<Array [[-0.15, -0.295], [0.754], ..., [1.06]] type='2421 * var * float32'>
>>> restructured.muon.phi
<Array [[-2.92, 0.0184], [-1.6], ..., [-0.98]] type='2421 * var * float32'>

by providing a Form that would do the restructuring, rather than doing it explicitly with Awkward functions.

The reason that's beneficial in the delayed case is because Dask identifies which input branches were actually used in the calculation (only Muon_Px, Muon_Py, and Muon_Pz in the above example) and does not read branches that are not used by the calculation. However, if we did the above as a Dask operation with dak.zip, all of the muon and electron fields would be identified as "used in the calculation" and the Dask workers would end up reading all branches. With ~1000 branches, that makes an orders-of-magnitude difference, and so known_base_form was added as a backdoor to restructure the branches with the desired nestedness outside the context of tracking which branches are used in the calculation.

The files passed to uproot.dask still need to be fully opened and interpreted, though this is now done on Dask workers. Knowing the Form that the data will take is not enough to know where in the file to find the data (at which byte positions), so there aren't any shortcuts taken there. Even though you have thousands of files with the same branch names and titles, the metadata describing them has to be parsed to get at the arrays of TBasket locations (fBasketEntries, fBasketSeek, and fBasketNbytes).

On the other hand, that kind of shortcut can be achieved with a database full of byte positions in ROOT files at which to find the data, as well as their interpretations. This is something that we started exploring with tiled-uproot, which we talked about at this IRIS-HEP Topical meeting (video available). I still think this would be a good thing to pursue, since the metadata-parsing is painful when it has to be done many times in pure Python, and this would be a way to skip that step every time after the first.

jpivarski commented 8 months ago

I think this is a Discussion; I'm going to move it over there.