Question for finding string similarity

@Shellcat-Zero sorry for the long silence.

NearPy is very modular and allows users to customize the pipeline they are using.

It is however based on numerical vectors. So you would need to convert your strings to numerical vectors. I bet there are a couple of methods for this out there. The most straightforward way I can think of is to first lower case the name and then map the string to an array of numbers based on the character value. Depending on which encoding you are using (UTF8/UTF16) this might result in values between 0 and 255 or much larger for each character position.

Another aspect you would need to consider is the maximum name length, in characters. Because this would determine the dimension of your vector space.

Let's consider this example, where you have these names to store

Pauline Georgie Peter Sebastian

The maximum name length is 9 (Sebastian) so your vector space should be of (at least) dimension 9.

You would then turn those names into numerical vectors of size 9 each (one number per character) and use the pipeline as usual.

However I might be that NearPy is NOT the framework for your project. There are so many really good Python frameworks out there for language and string processing, maybe some of them would be a better pick:

https://spacy.io/ https://radimrehurek.com/gensim/ http://www.nltk.org/

More "learning" focused, but might be useful as well:

https://scikit-learn.org/stable/

I hope I am not too late with my response. Good luck with your project!

pixelogik / NearPy

Question for finding string similarity #84