oborchers / Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!
GNU General Public License v3.0
616 stars 83 forks source link

Returning vectors with similarity above threshold for most_similar() #34

Open lucas-ubm opened 3 years ago

lucas-ubm commented 3 years ago

In sentencevectors.py most_similar() can return the topn most similar words. However it would be useful to be able to specify a similarity threshold above which the sentences are returned. For this topn could take a fractional value and therefore if topn is strictly smaller than 1 then it's considered a threshold and otherwise it works in the same way as it does now.

oborchers commented 3 years ago

Yes this is absolutely correct. However, the current implementation is actually highly inefficient in terms of similarty search (brute force). I had plans to include approximate nearest neighbor search, but haven't found time to implement it