rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.63k stars 116 forks source link

Damerau-Levenshtein with character-dependent weights #241

Open pulsar70 opened 2 years ago

pulsar70 commented 2 years ago

In relation to issue #180:

It would be great to generalize the foreseen implementation of the Dameau-Levenshtein distance to weights depending not only on the nature of the operation (insert/delete/substitute/transpose), but also on the specific character pairs involved in the operation.

Such weights could be either read from ressource files directly by the C/C++ implementation code, or provided as parameters (Python dictionaries) by the calling Python function.

This would be a very useful and not-so-complicated first step in the direction of context-sensitive string distance functions.

maxbachmann commented 2 years ago

I think this should start off with a similar interface to the levenshtein distance:

distance(s1, s2, *, weights=(1,1,1))

which should be extended by support for float. In addition this could support a custom dict like structure for character weights which behaves similar to:

class PairwiseWeights:
    ...

a = PairwiseWeights()
a["a", "b"] = 0.75

The pure Python version will likely implement this using a dictionary, while for the C++ implementation an efficient storage structure is important. This could look something like this:

template<typename T>
class PairwiseWeights
{
    std::array<std::array<T, 256>, 256> m_extendedAscii;
    my_hashmap<std::tuple<int64_t, int64_t>, T> m_map;
};

This has an overhead of around 256kb for the extended ascii matrix, which would make the common case of char1 < 256 and char2 < 256 very fast. For the hashmap something similar similar to the Python dict using perturbation would probably work well.

Edit: this needs a bit of thought on how to name the containers. For now we probably need one for Substitutions/Transpositions and one for Insertions/Deletions

pulsar70 commented 2 years ago

Yes, you clearly need an efficient storage + fast retrieval (hashtable-type) structure.

As far as I understand, the existing distance implementations work on any unicode strings? Then why limit the character range to extended ASCII? One should be able to provide weights for any Unicode points pair, in my opinion.

maxbachmann commented 2 years ago

As far as I understand, the existing distance implementations work on any unicode strings? Then why limit the character range to extended ASCII? One should be able to provide weights for any Unicode points pair, in my opinion.

This continues to be true. It is a mixed structure:

template<typename U, typename K>
T get(U ch1, K ch2)
{
    if (ch1 < 256 && ch2 < 256)
        return m_extendedAscii[ch1][ch2];
   return m_map.get(ch1, ch2);
}

since in many cases strings are ascii only this can improve the access times significantly, while it still supports any hashable type. Especially since Python strings are stored as uint8_t*/uint16_t*/uint32_t* depending on the max char in the string. So this will be optimized to:

template<typename U, typename K>
T get(U ch1, K ch2)
{
    return m_extendedAscii[ch1][ch2];
}

for strings with all characters < 256

maxbachmann commented 2 years ago

Note that this will likely take a while to get implemented, since this needs to be pretty generic to cover most use cases. Some examples are:

How do those interact, since you probably want to be able to mix them to some extent. An example is a keyboard layout:

Would it be possible to combine e.g. character dependent and position dependent weights? This might be useful sometimes.

In addition I am unsure how normalization would work with these weights?

pulsar70 commented 2 years ago

I was actually only suggesting the technical possibility to specify these weights, leaving the choice of their actual values to the package user (one could provide default ones to play with if really desired, of course).

I have my own custom-made C++ implementation of a Generalized Edit Distance with character-dependent weights,that are fully configurable from simple user-editable txt files. Default values are taken by all those character (pairs) no explicitly mentioned in the config. file, so it enables me to act upon only those values I want to depart from the standard weight.

Should you find it helpful, I would be glad to send you my implementation files. Seeing it properly cythonized in RapidFuzz (which I am technically unable to do myself) would be delightful.

maxbachmann commented 2 years ago

I was describing possible use cases for users of the API to decide on an API. I think for now it would be enough to add character dependent weights. If other weights need an API change this can be done in a later major release.

maxbachmann commented 1 year ago

Should you find it helpful, I would be glad to send you my implementation files. Seeing it properly cythonized in RapidFuzz (which I am technically unable to do myself) would be delightful.

Sure. Having a reference implementation would be helpful.