luozhouyang / python-string-similarity

A library implementing different string similarity and distance measures using Python.
MIT License
992 stars 127 forks source link

Find most similar word among target and list of words #2

Closed alberduris closed 5 years ago

alberduris commented 5 years ago

First of all, let me congratulate the dev for this amazing library.

I was wondering if it is implemented some kind of function that allow to find the most similar word among a target word and a vocabulary. In example;

Target word: tsring Vocabulary: ['hello', 'world', 'string', 'foo', 'bar']

So maybe something like:

jw = JaroWinkler()
jw.most_similar('tsring', ['hello', 'world', 'string', 'foo', 'bar'])
[1] 'string'

I've tried the same construction for the distance and similarity methods but although no error is thrown it seems that the operation is not supported.

jw.distance('tsring', ['hello', 'world', 'string', 'foo', 'bar'])
[1] 1.0
jw.similarity('tsring', ['hello', 'world', 'string', 'foo', 'bar'])
[2] 0.0

I know it's trivial to implement an independent function with this behavior based on the distance or similarity functions. But just in case a highly-optimized function is already implemented :)

Thanks in advance!

alberduris commented 5 years ago

In case someone is interested in doing it in a naïve way (as a workaround);

Function definition:

def most_similar(target, vocab, method):
    sims = []
    for word in vocab:
        sims.append(method.similarity(target, word))
    return vocab[np.argmax(sims)]

Usage:

target = 'tsring'
vocab = ['hello', 'world', 'string', 'foo', 'bar']
jw = JaroWinkler()

most_similar(target, vocab, jw)
[1] 'string'
luozhouyang commented 5 years ago

You can always build high-level apis based on similarity and distance by yourself.