Reading large doubly jagged arrays iteratively/lazily?

Edit: sorry, some copy&paste error destroyed the intro :wink: now fixed

So basically what we need is to process many branches of doubly jagged arrays in files which do not fit into memory.

Here is an example with a small file, showing the structure of such a branch (with uproot3 which falls back to ObjectArray and uproot4 which gives a nice awkward1 array):

import uproot
import uproot4
from km3net_testdata import data_path
print(uproot.__version__)  # 3.12.0
print(uproot4.__version__)  # 0.0.20

filename = data_path("offline/km3net_offline.root")
f = uproot.open(filename)
f4 = uproot4.open(filename)

f["E/Evt/trks/trks.rec_stages"].array()
# <ObjectArray [[[1, 3, 5, 4], [1, 3, 5], [1, 3], ...

f4["E/Evt/trks/trks.rec_stages"].array()
# <Array [[[1, 3, 5, 4], [1, ... 1], [1], [1]]] type='10 * var * var * int64'>

I uploaded a larger file here: http://131.188.167.67:8889/doubly_jagged.root for which f["E/Evt/trks/trks.rec_stages"].array() takes extremely long.

So I tried to utilise uproot4.lazy() and played around with uproot4.iterate() but I have not figured out how to read the data iteratively. Here are my naive approaches (based on some

trks = uproot4.lazy({"doubly_jagged.root": "E/Evt/trks"})
trks[0, "trks.rec_stages"]

which yiels:

ValueError: generated array does not conform to expected form:

{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "ListOffsetArray64",
        "offsets": "i64",
        "content": "int32"
    }
}

but generated:

{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "ListOffsetArray64",
        "offsets": "i64",
        "content": "int64"
    }
}

(https://github.com/scikit-hep/awkward-1.0/blob/0.2.33/src/libawkward/virtual/ArrayGenerator.cpp#L46)

I am not sure why it confuses between int32 and int64, maybe due to empty arrays? No clue...

ping @zinebaly

Okay, the quick answer to your question is that you can use entry_start and entry_stop parameters to limit which TBaskets are read by an array or arrays call. If you do this in little intervals through a dataset, it could be suboptimal because the TBasket boundaries might not line up with your chosen entry_start/entry_stop ranges and you would end up re-reading the TBasket that covers different entry intervals. (Note that there's TBranch.entry_offsets to find out where a branch's TBasket boundaries are and HasBranches.common_entry_offsets to find out where a set of branches have TBasket boundaries that line up, if at all.)

(Uproot 4 doesn't have a basket_cache like Uproot 3: Uproot 3 had too many different types of caches that required the user to have deep understanding of how ROOT files work in order to use them effectively. Also, iterate solves this problem in a better way than a blind LRU cache.)

A better way to iterate over intervals of entries is TBranch.iterate (for a single TTree) and uproot4.iterate (for a set of files). I think you know about this method and function; the advantage over calling array or arrays on intervals is that iterate knows it's sequential, so it knows to save the TBasket data from one iteration to use it in the next.

Lazy arrays are full of subtleties. As soon as I saw the issues they were causing, I regretted having introduced! However, they are popular, and the right thing to do is to figure out how to do them well, which will take some experience. While we work out these issues, I always recommend performance-minded users to use array, arrays, or iterate because at least you can control exactly when things get read. It requires more diligence to set the first argument properly (many users read all arrays, even when they plan to use only one or two), which is probably why lazy is so popular, but in ceding the responsibility of choosing what gets read to Uproot+Awkward, Uproot+Awkward will have to get better at not reading arrays prematurely.

As for the lazy Form bug you found, I've fixed it. It's because all AsObjects interpretations in Awkward Array are passed through ak.from_iter and that function has only one integer type and one floating point type (like Python, which is appropriate because ak.from_iter is intended for Python data).

The performance issue with doubly jagged arrays is related: because of TTree serialization, more than one level of jaggedness has to be interpreted AsObjects, and so it needs to go through ak.from_iter (like Uproot 3). I've been setting up the infrastructure to write the AsObjects → Awkward interpretation in Awkward's C++ layer, like a faster version of ak.ArrayBuilder because it knows its type at the outset. (That's the ak.forms.Form object.)

However, that optimization likely won't happen until next year—I'm setting things up to make it possible, but we have to do the uproot4 → uproot, uproot → uproot3 migration first. Also, before porting the read methods of all containers and models from Python to C++, I want them to fully stabilize in Python first. This includes understanding memberwise splitting (issue #38) because it comes up often enough. In assigning the is_memberwise attribute from kStreamedMemberWise, I found out that 100% of the std::map examples in our tests are memberwise—we've never seen a non-memberwise std::map.

Some more things you should know:

If you're going to use lazy anyway, update to the latest Awkward, which has corrections from @nsmith- to avoid circular references between an array and its own cache. This PR #92 fixes some Uproot things related to that: the caches have to be explicitly attached to high-level Arrays because the deeply nested ArrayCache objects now only hold weak references, and fail when those weak references go out of scope.
All of the public API modules, classes, functions, and methods have docstrings now.
I'm going to deploy a version with PR #92 to make sure no one gets into trouble with weak references.

Thanks Jim for the detailed answer! Yes, I understand of course the decision regarding the caches and I think that the uproot4 design is the right choice regarding caching and laziness. In our use-case the TBranch.iterate is the right choice and I also tried that (I forgot to add those examples) but I think I am either using them wrong, or there are some other issues which make it unusable in our case.

Given the file I uploaded (should still be online), the following shows that using TBranch.iterate() (assuming I am using it correctly) is painfully slow, it takes like 30 seconds per iteration although each entry is something like ~200 * ~10 * float64 and only retrieves 3 items (in the file there are 145000).

So in general, I am currently still struggling to find a way to somehow iteratively process these branches. Both trks.fitinf and trks.rec_stages are doubly jagged integer arrays and are needed heavily in the analysis script I am currently trying to improve -- 4 hours per file, while I think it should be just a few minutes, given the theoretical I/O and processing times. This branch format of our ROOT file is really making me cry but it is what it is. Another problem, which I will post in a future issue that working with these arrays is far from the numpy performance (doing things like arr > 0.5 takes ~2ms for 100 entries, while in numpy/Julia/C it should be more around a few hundred ns), but that's another story.

Here is the example:

import uproot4
f = uproot4.open("doubly_jagged.root")

i = 0
for dir_z in f["E/Evt/trks"].iterate("trks.fitinf"):
    print(i, dir_z)
    i += 1
    if i > 10:
        break

# 0 [{'trks.fitinf': [[0.00296, 0.002, -293, 139, 424, 242, 0.01, ... [], [], [], []]}]
# 1 [{'trks.fitinf': [[0.00899, 0.00561, -188, 69, 14.4, 132, 4.88, ... [], [], [], []]}]
# 2 [{'trks.fitinf': [[0.00701, 0.00372, -38.5, 33, 9.26, 44.7, ... [], [], []]}]

# then it stops...

# This one takes around 2 minutes but gives 10 entries (there are 145000 or so in this file)
f["E/Evt/trks/trks.fitinf"].array()[:10]
# <Array [[[0.00296, 0.002, -293, ... [], []]] type='10 * var * var * float64'>

Just to make sure this does not get lost, should I create a new issue or should we reopen this?

At least the fact that it stops after the third iteration (although there should be 145000) is a bug, which might be related to the extremely slow performance. Even if those are Python loops, I think it should be orders of magnitudes faster.

I ran a line profiler with the following script:

import uproot4

f = uproot4.open("doubly_jagged.root")

for dir_z in f["E/Evt/trks"].iterate("trks.fitinf"):
    print(dir_z)
    # no need to break, it will stop after the third iteration

and here is the output of kernprof -l -v ... but it only revealed that 100% of the time is spent in _ranges_or_baskets_to_arrays and inside that function, 100% is spent on :

Total time: 499.249 s
File: /home/tgal/Dev/uproot4/uproot4/behaviors/TBranch.py
Function: _ranges_or_baskets_to_arrays at line 2972
...
...
  3086       463  498712540.0 1077132.9     99.9              interpretation_executor.submit(basket_to_array, basket)
...
...

which is spending 100% of the time in basket_to_array at line:

3062         6  490637778.0 81772963.0    100.0                  arrays[branch.cache_key] = interpretation.final_array(

To classify it properly, this isn't a correctness bug: I was able to iterate through the whole file using a ridiculously small step_size (in a Zoom meeting, so my fans were blasting):

for x in tree.iterate("trks.fitinf", step_size=100): print(x)

(The default is tree.num_entries_for("100 MB", "trks.fitinf"), which is 63658. That's 100 MB read from disk, which can translate into a lot of working memory with the current implementation because the current implementation has to construct Python objects, which are known to be memory hogs.)

Data of this type are known to have poor performance for understood reasons, and there's a plan in place to fix it (move the non-vectorized loops to C++). So it wouldn't add much to create an issue; I think there's already some roadmap somewhere saying that I need to do this next year.

The plan to fix it is a major project: I've already encoded the information about Uproot Interpretations into Awkward Forms, but now I've got to translate the Python Interpretation code (not entirely finalized yet!) into a machine in C++ that steps through the TBasket byte stream using the Form as a guide—more likely a streamlined version of it because it will be runtime-interpreted for every entry. Technically, that's a virtual machine in the sense of the JVM or Python VM. (Very specific VMs can be fast, like the NumExpr VM, which is faster than precompiled NumPy. It should be highly specialized for speed, but that requires a thorough understanding of what it needs to do, including memberwise Interpretations.)

Because of your troubles, I was thinking of doing a quick, "foot in the door" implementation that handles only one type: std::vector<std::vector<T>> where T is a numeric type (bools, ints, and floats). That would force me to build the interface between Uproot and Awkward but not the full virtual machine. Since that's a very common type, it would be a good impact-to-effort ratio—helping a lot of people with a relatively small investment.

However, I see that what you actually have are not std::vector<std::vector<T>>, but

trks.fitinf          | std::vector<double>* | AsObjects(AsArray(True, False, AsVector(False, dtype('>f8'))))

That's odd enough that it wouldn't actually have a high impact. It would really just help your case and force me to set up the interface between Uproot and Awkward. I could do it for two very specific types:

std::vector<int32_t>*, which is AsObjects(AsArray(True, False, AsVector(False, dtype('>i4')))) and applies to trks.rec_stages and trks.hit_ids
std::vector<double>*, which is AsObjects(AsArray(True, False, AsVector(False, dtype('>f8')))) and applies to trks.usr, trks.fitinf, and trks.error_matrix.

which wouldn't include any of the strings. The implementation on the Awkward side would check to see if a TBranch has exactly this type and run a canned algorithm to fill it—no virtual machine. (In fact, the code that I write now would have to be removed to create the virtual machine, so I can't invest much time into it.) If you have any other types (like 64-bit integers), let me know now because I'd have to explicitly include them.

Ah, I should have tried much smaller step-sizes. I see indeed, it's working with that setting!

The plan to fix it is a major project: I've already encoded the information about Uproot Interpretations into Awkward Forms, but now I've got to translate the Python Interpretation code (not entirely finalized yet!) into a machine in C++ that steps through the TBasket byte stream using the Form as a guide—more likely a streamlined version of it because it will be runtime-interpreted for every entry. Technically, that's a virtual machine in the sense of the JVM or Python VM.

Absolutely; that sounds like a massive undertaking!

I feel a bit uncomfortable to steal your time -- also given that this code will be obsolete in future -- since you have a lot on the roadmap for uproot4 already (and other things), but it would be certainly quite nice to see what's possible using your above-mentioned approach. Regarding other types, I could follow your patch and see if I can implement those if needed, but at this very moment these are the only problematic branch types we have.

Anyways, many thanks in advance and If you find it takes too much time, just skip it and we can go with the small-step iteration or figure out some other way. Unfortunately this file format will likely not change in the upcoming years, so I guess we will have a few petabytes of them to process in the future :see_no_evil: ...

Maybe you can try harder to push the file format to change. Having non-splittable double-vectors is probably giving worse performance and compression for classic ROOT analyses as well. Can you convince your collaborators to normalize to using two sets of singly-jagged arrays? See e.g. https://github.com/cms-nanoAOD/cmssw/issues/92#issuecomment-499284307

Well, it will be a hard discussion but I will of course try. It will however not happen before next years and we have already quite a large amount of data stored this way, which is constantly analysed over and over.

But I appreciate your suggestions, I'd definitely go that way and try harder.

These data are already split—the problem is that it's doubly jagged. See PR #96: it was the plan since CHEP (page 26/29) to pass non-vectorizable deserializations to Awkward's C++ layer—this is the first example of that.

I had been hoping that this doubly jagged array of numbers would be generalizable, but it's std::vector<double>*, not std::vector<std::vector<double>>, so it's probably not going to be usable by many. At the moment, the implementation checks for that one type and handles it, but in the long run, it needs to be a translation of the Python (deserialization.py, containers.py, and models/*.py). Since Python is easier to modify and correct, I'd like to get a lot of experience with the Python before lowering it all to C++.

scikit-hep / uproot5

Reading large doubly jagged arrays iteratively/lazily? #90