uproot4 to uproot5 TTree to DataFrame arrays slower

masterfelu commented 4 months ago

Hello.

I run a script to load a TTree as a DataFrame with 5 million rows and 3 columns, and measure the time taken. Here's an MWE for ipython:

%timeit uproot.open('tree.root')['time'].arrays(['year','month','day'],library='pd')

For uproot 5.3.7 with pandas 2.2.2, 19.2 s ± 177 ms per loop For uproot 4.1.9 with pandas 1.3.5, 1.61 s ± 9.27 ms per loop

The value reported is mean ± std. dev. of 7 runs, 1 loop each. We can see uproot5 is 4 times slower than uproot4. The discrepancy increases further when more columns are loaded, leading to more than ten minutes of time for what previously took seconds.

I just want to know if there has been any major change that can cause such a reduction of load time for large TTree to DataFrame. Also, if any more checks should be done before I reach a conclusion, that is super helpful.

Thanks a lot for your time.

vvsagar commented 4 months ago

It might also be important to mention the differences in the pandas versions in order to reproduce the results?

JoyYTZhou commented 1 month ago

I've also noticed the same issue with uproot.open('someNANOAOD12.root:Events').arrays(lib='ak'). This file has 680 entries and a size of 4.053MB. I am timing only the operation of getting the arrays for the tree object, and it takes 20.8 seconds. I never had such issues before with much larger files with some older version of uproot (I think it was still 5 but a particular release). I am now using version 5.4.1. I am not sure if it's an issue with uproot or awkward. I am leaning toward it being an uproot error since I've had my awkward version fixed in virtual env.

nikoladze commented 1 month ago

Since this popped up again in a Belle II internal question i'll post a reproducer here:

The issue is mainly visible for TTrees with many baskets. Using fewer/larger baskets seems to mitigate the issue, but maybe there is still something that can be improved on the uproot side. So to reproduce we can make a TTree with many baskets:

import uproot
import numpy as np

num_baskets = 1_000
num_per_basket = 1_000
num_branches = 50
with uproot.recreate("test.root") as f:
    for i_chunk in range(num_baskets):
        chunk = {f"branch{i}": np.random.rand(num_per_basket) for i in range(num_branches)}
        if "tree" in f:
            f["tree"].extend(chunk)
        else:
            f["tree"] = chunk

This TTree can be read with uproot into a pandas DataFrame, both in uproot4/5 with e.g.

import uproot
import numpy
import time

def load(filename):
    with uproot.open({filename: "tree"}) as tree:
        return tree.arrays(library="pd")

For the runtime i get something like the following on my laptop:

uproot4:

>>> start = time.time(); df = load("test.root"); print(time.time() - start)
5.7701661586761475

uproot5

>>> start = time.time(); df = load("test.root"); print(time.time() - start)
71.41624069213867

Note: i struggled a bit getting an environment running with uproot4 - i didn't have any success with plain venv/pip (numpy version that was automaticaly resolved is incompatible and i wasn't able to install an older one), but it worked with conda-forge/pixi with the following setup:

pixi init uproot4
cd uproot4
pixi add python 'uproot==4' 'numpy<1.20' 'pandas<2' setuptools
pixi run python

nikoladze commented 1 month ago

Update: i found out that the performance regression came in somewhere between uproot version 5.3.3 (there it still seems to be as fast as in v4) and 5.4.1.

jpivarski commented 1 month ago

Note to self: I'm looking at this. https://github.com/scikit-hep/uproot5/compare/v5.3.3...v5.4.1

nikoladze commented 1 month ago

git bisect tells me this commit introduced it: https://github.com/scikit-hep/uproot5/commit/13087b0f6d179bb8982249d8fda8cda5f3b8e44c

nikoladze commented 1 month ago

indeed seems to be related to the fsspec source - if i do uproot.open(..., handler=uproot.source.file.MemmapSource) there is no performance regression.

jpivarski commented 1 month ago

@nsmith-, do you see this?

Also, @nikoladze, is library="pd" necessary? Since this is flat data (not ragged), library="np" should be high-performance.

There shouldn't be a connection between the read-coalesce feature (which is low-level, affecting how baskets are read) and the Pandas output (which is high-level; the NumPy arrays are wrapped as Pandas Series/into a DataFrame as a last step). If a performance regression depends on both PR #1198 and library="pd", rather than just one of these, then it's harder to understand.

nikoladze commented 1 month ago

the performance regression can also be seen with library="np" - this seems not related to the pandas conversion

nikoladze commented 1 month ago

seems this property is taking up significant amount of time when there are many ranges:

https://github.com/scikit-hep/uproot5/blob/de2f0d57d4c9ebce07a856381248e8a5f3b0ea71/src/uproot/source/coalesce.py#L59-L60

jpivarski commented 1 month ago

If that's the case, it's ironic, since the chunk-coalescing was intended to speed up remote processing by having a smaller number of requests in flight. You're saying that it's slowing down local request-preparation by working too hard to make a smaller set of things to request?

Ah, but the motivation for having fewer requests is that networks can be latency-limited (number of requests matters more, relative to the amount of data transferred). That's not relevant for local files.

Independently of this problem, I found that uproot.MemmapSource scales better in GIL-free Python 3.13 (talk today). From that alone, I've been thinking that we should consider restoring uproot.MemmapSource as the default for local files (and keep using fsspec as the default for remote files, with chunk-coalescing). But this issue could be a second reason.

If so, I suppose this issue could be closed by actually making uproot.MemmapSource the default.

~~I'm not sure whether a change like that would require a new major version of Uproot (version 6.0.0!) or a minor one (5.5.0).~~ Actually, no: it should be a minor one. It changes the performance behavior of Uproot, and it's technically a different interface (has different attributes), but for the level of depth that most users look, it's the same API—just runs faster.

nikoladze commented 1 month ago

I think the performance issue can be fixed relatively easily (i can open a PR) in the coalesce module - this may also benefit remote reading.

Independent of that i can also imagine that uproot.MemmapSource makes sense for local files - there is also the page cache and potential read ahead mechanisms from the operating system between so stuff like chunk coalescing should not be necessary.

nsmith- commented 1 month ago

Ok, we can probably just cache the cluster stop in this subroutine and reduce this bottleneck.

On the subject of MemmapSource, recall the main issue with it is each such memmaped file counts against user vsize ulimit and some sites have a ulimit on that that cannot be adjusted

jpivarski commented 1 month ago

On the subject of MemmapSource, recall the main issue with it is each such memmaped file counts against user vsize ulimit and some sites have a ulimit on that that cannot be adjusted

Oh yeah, that's right.

Hopefully, when I repeat the performance plots I showed today, we'll find that the fsspec-memmap differences are not too big, so that we're not in the awkward position of having to choose between the best parallel processing speed (which is irrelevant to single-threading) and the least memory (according to one definition, which counts memmapped data). Ultimately, it's a configurable parameter so both parties can be happy, but it can be hard to discover that a particular parameter is relevant for one's problem.

scikit-hep / uproot5

uproot4 to uproot5 TTree to DataFrame arrays slower #1257