Closed kratsg closed 9 months ago
You beat me to it, so here's the additional content I was writing:
Instead, ReadOnlyDirectory.__contains__
should
def __contains__(self, where):
if not hasattr(self, "_cached_keys"):
self._cached_keys = set(self.iterkeys()) | set(self.iterkeys(cycle=False))
return where in self._cached_keys
Thus, any string where
that can be used in __getitem__
, which is in the keys()
with cycle numbers or without cycle numbers, would return True from __contains__
.
Then __contains__
would be the fast function we expect it would be.
In addition, the damerau_levenshtein
function should be skipped when there's a very large number of keys. I haven't looked at the implementation recently, but I'll bet it's O(n²).
In pyhf
where we have histograms in two possible paths (because of how HistFactory works) - avoiding the DeserializationError
by comparing against a (cached) set of keys gives us a speedup from 20hrs down to ~10seconds.
Error handling code, for errors that are supposed to propagate to the user, need not be fast. As long as the error gets raised on a human timescale (< 1 second), it's fine.
It's only a problem when errors are caught as non-exceptional cases (e.g. StopIteration
). The fact that try-except allows us to use exceptions for these two very different purposes puts "Do I have enough information in the error message?" versus "Is this fast enough?" into conflict.
This issue was fixed in #595
There is a
damerau_levenshtein
function being called when we hit a missing key that takes a long time when the number of keys in a file is very large (https://github.com/scikit-hep/uproot4/blob/85f219a36e76dffc18da4756227a7beb760657a0/src/uproot/_util.py#L810-L858).Similarly,
potential_name in directory
is also not the best: https://github.com/scikit-hep/uproot4/blob/85f219a36e76dffc18da4756227a7beb760657a0/src/uproot/reading.py#L1913-L1919For now, we'll have to call
directory.keys()
and store that in a set for valid key lookups.Originally posted by @kratsg in https://github.com/scikit-hep/pyhf/discussions/1687#discussioncomment-1621078