Closed romanovzky closed 4 years ago
That's interestingt. I have never seen such long loading times and I work with very large files (although I do not work with TH2D
often), however, those usually have just a few branches. It seems you have hundreds of thousands of keys? The way the keys (and the corresponding data) are traversed in uproot is presumable not ideal for this structure, but to me it also seems like the way the data is managed in your ROOT file is not ideal either, but I don't want to judge that without seeing the file first.
Is there a way to make that file somehow available for us? It's hard to tell why uproot is taking that long, both for opening and access the keys without direct access to the file.
First and last sequentially in the file, like the order of TKeys in the TDirectory? That could be because the TKeys are stored as a list and the search for a matching name is a linear walk. For TBranch lookup in Uproot 4, I added an auxiliary dict to make repeated gets faster (I noticed that's how people were using it), but the same night be true of objects in TDirectories. I could add a dict here, too, if the usage pattern is to repeatedly request histograms.
(To notice a difference due to the linear search, you must have a lot of histogram objects in the directory. Thinking about the way ROOT stores histograms (each is fully independent with lots of metadata, independently compressed), there must be more efficient ways to store and read back large collections of histograms with mostly identical metadata (i.e. store binning once). But that's a different thing.)
Sorry for the delayed reply guys.
I've refactored my code to use the iteration generator, it partially solved my issue by reducing running time of the overall task in over an order of magnitude. So that's done.
Indeed this is not a usual data format, but we are doing something unusual.
Thanks for your hardwork with uproot!
Hi there,
I have a
ROOT
file with 1179372 objects (TH2D
). I first open it withuproot.rootio.open
which takes around 1 minute, which is not super fast but acceptable. I then noticed that processing the lines was getting slower and slower as it went from the beginning to the end and decided to investigate it further
So there is almost 3 orders of magnitude degradation in I/O reading for later entries of the file.
Is this expected behaviour? What should I do to mitigate this issue?
Cheers