google / personfinder

Person Finder is a searchable missing person database written in Python and hosted on App Engine.
https://google.org/personfinder
Apache License 2.0
536 stars 195 forks source link

Allow spelling variants of Indian names #345

Open gimite opened 7 years ago

gimite commented 7 years ago

This is a feedback provided by someone:

There are various forms of some common Indian names when written in English. transliteration -> reverse transliteration could be used to automatically generates other spelling varients of a name during query time/record creation time. Examples of some equivalent names that have the same characters in Hindi script: Ajay/Ajai, Rohit/Rohith, Sita/Seeta/Sitha/Seetha

I'm not familiar with Indian languages, so I don't know what can be a feasible solution. Suggestions are welcome.

Any existing library to provide a list of equivalent names, or perform transliteration of Indian names? Would it be sufficient with a some kind of a dictionary? Can it be solved with some machine learning?

singhai0 commented 7 years ago

I remember a friend mentioning this issue a few days ago. My first instinct was to use simple fuzzy matching (something akin to what fuse does). While it works well on names, I'm not familiar with the scale of the problem (for instance, how it would work with a large number of entries).

Reverse transliteration using deep belief networks is also a possibility.

What say @gimite?

gimite commented 7 years ago

We use AppEngine's Search API (*1): https://cloud.google.com/appengine/docs/standard/python/search/ So it should be implemented on top of it.

e.g., There are multiple scripts with the same sound in Japanese e.g., 山田, やまだ, ヤマダ for Yamada. To allow them to match with other variants, we normalize all of them into Latin alphabet (Yamada) in the index: https://github.com/google/personfinder/blob/master/app/script_variant.py Maybe we can do something similar?

*1: It's still not enabled by default, but we are trying to switch to it.