scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
234 stars 74 forks source link

Slow "contains" lookup and error message reporting for files with large amounts of keys #504

Closed kratsg closed 7 months ago

kratsg commented 2 years ago

There is a damerau_levenshtein function being called when we hit a missing key that takes a long time when the number of keys in a file is very large (https://github.com/scikit-hep/uproot4/blob/85f219a36e76dffc18da4756227a7beb760657a0/src/uproot/_util.py#L810-L858).

Similarly, potential_name in directory is also not the best: https://github.com/scikit-hep/uproot4/blob/85f219a36e76dffc18da4756227a7beb760657a0/src/uproot/reading.py#L1913-L1919

For now, we'll have to call directory.keys() and store that in a set for valid key lookups.

Originally posted by @kratsg in https://github.com/scikit-hep/pyhf/discussions/1687#discussioncomment-1621078

jpivarski commented 2 years ago

You beat me to it, so here's the additional content I was writing:

Instead, ReadOnlyDirectory.__contains__ should

def __contains__(self, where):
    if not hasattr(self, "_cached_keys"):
        self._cached_keys = set(self.iterkeys()) | set(self.iterkeys(cycle=False))
    return where in self._cached_keys

Thus, any string where that can be used in __getitem__, which is in the keys() with cycle numbers or without cycle numbers, would return True from __contains__.

Then __contains__ would be the fast function we expect it would be.

jpivarski commented 2 years ago

In addition, the damerau_levenshtein function should be skipped when there's a very large number of keys. I haven't looked at the implementation recently, but I'll bet it's O(n²).

kratsg commented 2 years ago

In pyhf where we have histograms in two possible paths (because of how HistFactory works) - avoiding the DeserializationError by comparing against a (cached) set of keys gives us a speedup from 20hrs down to ~10seconds.

jpivarski commented 2 years ago

Error handling code, for errors that are supposed to propagate to the user, need not be fast. As long as the error gets raised on a human timescale (< 1 second), it's fine.

It's only a problem when errors are caught as non-exceptional cases (e.g. StopIteration). The fact that try-except allows us to use exceptions for these two very different purposes puts "Do I have enough information in the error message?" versus "Is this fast enough?" into conflict.

ioanaif commented 7 months ago

This issue was fixed in #595