plasticityai / magnitude

A fast, efficient universal vector embedding utility package.
MIT License
1.63k stars 120 forks source link

A minor change needed in the most_similar function when used by vector #29

Closed ParikhKadam closed 6 years ago

ParikhKadam commented 6 years ago

Regenerate the output to understand the issue:

from pymagnitude import *
glove = Magnitude("path/to/glove.6B.300d.magnitude")
print(glove.most_similar("cat", topn = 2)) # Most similar by key
print(glove.most_similar(glove.query("cat"), topn = 2)) # Most similar by vector

Output will be as follows:```

[('dog', 0.6816746), ('cats', 0.68158376)] [('cat', 1.0), ('dog', 0.6816746)]



As one can clearly see that the function most_similar works perfectly when called by key. But it returns the same word when used by passing that word's vector. This should not be the case. A minor modification in code should be made to not take into account the same word as output.
AjayP13 commented 6 years ago

I disagree. I think this is useful. There is no way for the library to know that you already know the vector you are giving it is of the key "cat". If a neural network ends up predicting a certain vector exactly, you would want to know what that vector's key is by looking it up with most_similar. If you don't care about the first result for your particular application, you can just ignore any result with 1.0 in your application code.

ParikhKadam commented 6 years ago

So, why does't the most_similar by key return the same word? It would be useful too.. In short, I am saying that the answer to both these calls must be same. A change is surely needed.

AjayP13 commented 6 years ago

Because, you are explicitly giving the most_similar function the key "cat" in the first example and not giving it explicitly the key "cat" in the second example. The library already knows you know that "cat" exists in the first example, thus there is no need to return the same key again. The library does not know that you already know the key for "cat" exists in the second example. You are doing two function calls there. One for query and then another for most_similar.

ParikhKadam commented 6 years ago

Got it bro.. Thanks. I was wrong. Great logic applied here.