Closed masterfelu closed 1 month ago
It might also be important to mention the differences in the pandas versions in order to reproduce the results?
I've also noticed the same issue with uproot.open('someNANOAOD12.root:Events').arrays(lib='ak')
. This file has 680 entries and a size of 4.053MB. I am timing only the operation of getting the arrays for the tree object, and it takes 20.8 seconds.
I never had such issues before with much larger files with some older version of uproot (I think it was still 5 but a particular release). I am now using version 5.4.1. I am not sure if it's an issue with uproot or awkward. I am leaning toward it being an uproot error since I've had my awkward version fixed in virtual env.
Since this popped up again in a Belle II internal question i'll post a reproducer here:
The issue is mainly visible for TTrees with many baskets. Using fewer/larger baskets seems to mitigate the issue, but maybe there is still something that can be improved on the uproot side. So to reproduce we can make a TTree with many baskets:
import uproot
import numpy as np
num_baskets = 1_000
num_per_basket = 1_000
num_branches = 50
with uproot.recreate("test.root") as f:
for i_chunk in range(num_baskets):
chunk = {f"branch{i}": np.random.rand(num_per_basket) for i in range(num_branches)}
if "tree" in f:
f["tree"].extend(chunk)
else:
f["tree"] = chunk
This TTree can be read with uproot into a pandas DataFrame, both in uproot4/5 with e.g.
import uproot
import numpy
import time
def load(filename):
with uproot.open({filename: "tree"}) as tree:
return tree.arrays(library="pd")
For the runtime i get something like the following on my laptop:
uproot4:
>>> start = time.time(); df = load("test.root"); print(time.time() - start)
5.7701661586761475
uproot5
>>> start = time.time(); df = load("test.root"); print(time.time() - start)
71.41624069213867
Note: i struggled a bit getting an environment running with uproot4 - i didn't have any success with plain venv/pip (numpy version that was automaticaly resolved is incompatible and i wasn't able to install an older one), but it worked with conda-forge/pixi with the following setup:
pixi init uproot4
cd uproot4
pixi add python 'uproot==4' 'numpy<1.20' 'pandas<2' setuptools
pixi run python
Update: i found out that the performance regression came in somewhere between uproot version 5.3.3 (there it still seems to be as fast as in v4) and 5.4.1.
Note to self: I'm looking at this. https://github.com/scikit-hep/uproot5/compare/v5.3.3...v5.4.1
git bisect tells me this commit introduced it: https://github.com/scikit-hep/uproot5/commit/13087b0f6d179bb8982249d8fda8cda5f3b8e44c
indeed seems to be related to the fsspec source - if i do uproot.open(..., handler=uproot.source.file.MemmapSource)
there is no performance regression.
@nsmith-, do you see this?
Also, @nikoladze, is library="pd"
necessary? Since this is flat data (not ragged), library="np"
should be high-performance.
There shouldn't be a connection between the read-coalesce feature (which is low-level, affecting how baskets are read) and the Pandas output (which is high-level; the NumPy arrays are wrapped as Pandas Series/into a DataFrame as a last step). If a performance regression depends on both PR #1198 and library="pd"
, rather than just one of these, then it's harder to understand.
the performance regression can also be seen with library="np"
- this seems not related to the pandas conversion
seems this property is taking up significant amount of time when there are many ranges:
If that's the case, it's ironic, since the chunk-coalescing was intended to speed up remote processing by having a smaller number of requests in flight. You're saying that it's slowing down local request-preparation by working too hard to make a smaller set of things to request?
Ah, but the motivation for having fewer requests is that networks can be latency-limited (number of requests matters more, relative to the amount of data transferred). That's not relevant for local files.
Independently of this problem, I found that uproot.MemmapSource
scales better in GIL-free Python 3.13 (talk today). From that alone, I've been thinking that we should consider restoring uproot.MemmapSource
as the default for local files (and keep using fsspec as the default for remote files, with chunk-coalescing). But this issue could be a second reason.
If so, I suppose this issue could be closed by actually making uproot.MemmapSource
the default.
I'm not sure whether a change like that would require a new major version of Uproot (version 6.0.0!) or a minor one (5.5.0). Actually, no: it should be a minor one. It changes the performance behavior of Uproot, and it's technically a different interface (has different attributes), but for the level of depth that most users look, it's the same API—just runs faster.
I think the performance issue can be fixed relatively easily (i can open a PR) in the coalesce module - this may also benefit remote reading.
Independent of that i can also imagine that uproot.MemmapSource
makes sense for local files - there is also the page cache and potential read ahead mechanisms from the operating system between so stuff like chunk coalescing should not be necessary.
Ok, we can probably just cache the cluster stop in this subroutine and reduce this bottleneck.
On the subject of MemmapSource, recall the main issue with it is each such memmaped file counts against user vsize ulimit and some sites have a ulimit on that that cannot be adjusted
On the subject of MemmapSource, recall the main issue with it is each such memmaped file counts against user vsize ulimit and some sites have a ulimit on that that cannot be adjusted
Oh yeah, that's right.
Hopefully, when I repeat the performance plots I showed today, we'll find that the fsspec-memmap differences are not too big, so that we're not in the awkward position of having to choose between the best parallel processing speed (which is irrelevant to single-threading) and the least memory (according to one definition, which counts memmapped data). Ultimately, it's a configurable parameter so both parties can be happy, but it can be hard to discover that a particular parameter is relevant for one's problem.
Hello.
I run a script to load a TTree as a DataFrame with 5 million rows and 3 columns, and measure the time taken. Here's an MWE for ipython:
For uproot 5.3.7 with pandas 2.2.2, 19.2 s ± 177 ms per loop For uproot 4.1.9 with pandas 1.3.5, 1.61 s ± 9.27 ms per loop
The value reported is mean ± std. dev. of 7 runs, 1 loop each. We can see uproot5 is 4 times slower than uproot4. The discrepancy increases further when more columns are loaded, leading to more than ten minutes of time for what previously took seconds.
I just want to know if there has been any major change that can cause such a reduction of load time for large TTree to DataFrame. Also, if any more checks should be done before I reach a conclusion, that is super helpful.
Thanks a lot for your time.