Slow reading later entries from file

scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.

BSD 3-Clause "New" or "Revised" License

314 stars 67 forks source link

Slow reading later entries from file #503

Closed romanovzky closed 4 years ago

romanovzky commented 4 years ago

Hi there,

I have a ROOT file with 1179372 objects (TH2D). I first open it with uproot.rootio.open

file = uproot.rootio.open(FILE_PATH)

which takes around 1 minute, which is not super fast but acceptable. I then noticed that processing the lines was getting slower and slower as it went from the beginning to the end and decided to investigate it further

first_key = b'Sample_10000002;1' #  This is the first TH2D
last_key = b'SamplePt_11099962;1' # This is the last TH2D

%%timeit
file[first_key]
626 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
file[last_key]
189 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So there is almost 3 orders of magnitude degradation in I/O reading for later entries of the file.

Is this expected behaviour? What should I do to mitigate this issue?

Cheers

tamasgal commented 4 years ago

That's interestingt. I have never seen such long loading times and I work with very large files (although I do not work with TH2D often), however, those usually have just a few branches. It seems you have hundreds of thousands of keys? The way the keys (and the corresponding data) are traversed in uproot is presumable not ideal for this structure, but to me it also seems like the way the data is managed in your ROOT file is not ideal either, but I don't want to judge that without seeing the file first.

Is there a way to make that file somehow available for us? It's hard to tell why uproot is taking that long, both for opening and access the keys without direct access to the file.

jpivarski commented 4 years ago

First and last sequentially in the file, like the order of TKeys in the TDirectory? That could be because the TKeys are stored as a list and the search for a matching name is a linear walk. For TBranch lookup in Uproot 4, I added an auxiliary dict to make repeated gets faster (I noticed that's how people were using it), but the same night be true of objects in TDirectories. I could add a dict here, too, if the usage pattern is to repeatedly request histograms.

(To notice a difference due to the linear search, you must have a lot of histogram objects in the directory. Thinking about the way ROOT stores histograms (each is fully independent with lots of metadata, independently compressed), there must be more efficient ways to store and read back large collections of histograms with mostly identical metadata (i.e. store binning once). But that's a different thing.)

romanovzky commented 4 years ago

Sorry for the delayed reply guys.

I've refactored my code to use the iteration generator, it partially solved my issue by reducing running time of the overall task in over an order of magnitude. So that's done.

Indeed this is not a usual data format, but we are doing something unusual.

Thanks for your hardwork with uproot!