Open gimite opened 7 years ago
I remember a friend mentioning this issue a few days ago. My first instinct was to use simple fuzzy matching (something akin to what fuse does). While it works well on names, I'm not familiar with the scale of the problem (for instance, how it would work with a large number of entries).
Reverse transliteration using deep belief networks is also a possibility.
What say @gimite?
We use AppEngine's Search API (*1): https://cloud.google.com/appengine/docs/standard/python/search/ So it should be implemented on top of it.
e.g., There are multiple scripts with the same sound in Japanese e.g., 山田, やまだ, ヤマダ for Yamada. To allow them to match with other variants, we normalize all of them into Latin alphabet (Yamada) in the index: https://github.com/google/personfinder/blob/master/app/script_variant.py Maybe we can do something similar?
*1: It's still not enabled by default, but we are trying to switch to it.
This is a feedback provided by someone:
I'm not familiar with Indian languages, so I don't know what can be a feasible solution. Suggestions are welcome.
Any existing library to provide a list of equivalent names, or perform transliteration of Indian names? Would it be sufficient with a some kind of a dictionary? Can it be solved with some machine learning?