scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
239 stars 78 forks source link

KeyInFileError to show near matches? #119

Closed andrzejnovak closed 4 years ago

andrzejnovak commented 4 years ago

KeyInFileError currently shows a truncating "Known keys" list. This cool regarding the general format of what to ask for instead, but since we quite commonly have files with a large number of branches one still needs to go back to some kind of [key for key in f.keys() if 'bla' in key] to find the correct name.

It would be really cool if this would instead show closest matches first. I guess this could get computationally pricey, i.e. slow, so maybe it would only get computed if the number of keys was not too crazy?

jpivarski commented 4 years ago

This is a good idea. Since it's already in an exception, speed is not an issue (especially since the time it would take to sort a list of O(<1000) strings by similarity is not much compared to the other work that's going on).

I looked around for hints on how to do this and found a variety of packages, but we don't want to any new dependencies into Uproot, least of all for this task. (I'm imagining a dialogue: "Why is it so hard to install Uproot in my weird DAQ environment?" "Because you need a fuzz package to sort key mismatches by similarity in KeyErrors.") But I came across this recipe for manually calculating the Levenshtein similarity ratio:

https://www.datacamp.com/community/tutorials/fuzzy-string-python

It's a 24 line function. That's not a significant maintenance burden for what it offers. There's a natural place for it in the _util module, which consists of private functions that we can always replace for something else later. Since the output is not programmatic, we don't have to worry that someone's going to depend on the exact order; if we want to change it later, that's still an option. (The exact character strings of an error message is not part of the API.) Also, I wouldn't implement it with a NumPy array as they have it; I'd use a Python list. There are no array-at-a-time function calls here; the NumPy array is nothing but an impediment that might complicate the use of Unicode.

If we were to go with this, I'd still show as many matches as fit on the screen (i.e. the current cut-off), but just sorted by similarity so that the one the user was looking for is unlikely to be after the cut-off.

andrzejnovak commented 4 years ago

Yup, that's pretty much exactly the ask.

jpivarski commented 4 years ago

Just thinking it through and setting parameters (e.g. no new dependencies). I used conditional tense because I don't think I'll be doing it soon, but if it's still open someday when I'm looking for an hour-long project, I'll do it, or if someone wanted to take it up, they could. Actually, I'll label it as a "good first project" because it doesn't require deep knowledge of the machinery: it appears in only one place.

(I do want to move the KeyInFileError (and all code, including behavior_of) out of the main __init__ and just import them there, so that would interfere with this if they were being done at the same time. But that's not hard coordination.)

andrzejnovak commented 4 years ago

Just going off the algorithm on wiki, this is pretty straightforward

def damerau_levenshtein(a, b):
    M = [[0]*(len(a)+1) for i in range(len(b)+1)]

    for i in range(len(a)):
        M[i][0] = i
    for j in range(len(b)):
        M[0][j] = j

    for i in range(len(a)):
        for j in range(len(b)):
            if a[i] == b[j]:
                cost = 0
            else:
                cost = 2
            M[i][j] = min(M[i-1][j] + 1,
                          M[i][j-1] + 1,
                          M[i-1][j-1] + cost
                             )
            if i > 1 and j > 1 and a[i] == b[j-1] and a[i-1] == b[j]:
                M[i][j] = min(M[i][j], M[i-2][j-2] + 1)

    return M[len(a)-1][len(b)-1]

print(damerau_levenshtein("Beer", "Love"))
print(damerau_levenshtein("Beer", "Bear"))
print(damerau_levenshtein("Beer", "Bee"))
print(damerau_levenshtein("Beer", "Beers"))
print(damerau_levenshtein("Beer", "Cola"))

>>> 3
>>> 2
>>> 1
>>> 1
>>> 4

Any concerns about putting this in a PR?

jpivarski commented 4 years ago

That looks good! As I said above, I think this should only be used to sort the list of keys, but since the list gets truncated, sorting it puts the most likely candidates before the cut-off. (So the cut-off is still determined by screen space, not misspell-likelihood. There are other reasons for having the wrong key name than misspellings.)

I also want to move KeyInFileError out of uproot4/__init__.py and into a named module. The uproot4/__init__.py file should only be for bringing classes and functions from submodules into a flat namespace that's easier for users to access. Perhaps as part of adding this, you could move the KeyInFileError to a new file called uproot4/exceptions.py and bring it into the flat namespace like everything else? That way, a future move doesn't cover up the attribution of your work.

Also, damerau_levenshtein belongs in uproot4/_util.py, since that's an implementation detail. We should leave the order of suggestions in this error message unspecified so that it's helpful, but not a backward-compatibility constraint on future versions.

If you want direct developer access to the repo, I can give you that, so that your PR is not across a fork.