pixelogik / NearPy

Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes.
MIT License
759 stars 152 forks source link

Question for finding string similarity #84

Open Shellcat-Zero opened 5 years ago

Shellcat-Zero commented 5 years ago

Hi,

I was hoping to leverage NearPy to find similarities between strings, but it's not clear to me how to query the engine with a string vector (if that's possible). My use case is that I have ~30 million names to store in the engine, and I have around 1.5 million names to submit as queries to find a best match from the engine. I was going to use your Redis storage adapter so that all of the queries could be submitted asynchronously. Please let me know if that is not a good use case for NearPy.

Thanks.

pixelogik commented 5 years ago

@Shellcat-Zero sorry for the long silence.

NearPy is very modular and allows users to customize the pipeline they are using.

It is however based on numerical vectors. So you would need to convert your strings to numerical vectors. I bet there are a couple of methods for this out there. The most straightforward way I can think of is to first lower case the name and then map the string to an array of numbers based on the character value. Depending on which encoding you are using (UTF8/UTF16) this might result in values between 0 and 255 or much larger for each character position.

Another aspect you would need to consider is the maximum name length, in characters. Because this would determine the dimension of your vector space.

Let's consider this example, where you have these names to store

Pauline Georgie Peter Sebastian

The maximum name length is 9 (Sebastian) so your vector space should be of (at least) dimension 9.

You would then turn those names into numerical vectors of size 9 each (one number per character) and use the pipeline as usual.

However I might be that NearPy is NOT the framework for your project. There are so many really good Python frameworks out there for language and string processing, maybe some of them would be a better pick:

https://spacy.io/ https://radimrehurek.com/gensim/ http://www.nltk.org/

More "learning" focused, but might be useful as well:

https://scikit-learn.org/stable/

I hope I am not too late with my response. Good luck with your project!