microsoft / PhoneticMatching

A phonetic matching library. Includes text utilities to do string comparisons on phonemes (the sound of the string), as opposed to characters.
MIT License
154 stars 31 forks source link

Support other Language #19

Open AyaAshrafSABER opened 4 years ago

AyaAshrafSABER commented 4 years ago

Why does PhoneticMatching support only English?

Mmdixon commented 4 years ago

The repo uses Flite to do grapheme-to-phoneme [G2P] (text input to sound) and mappings exist for languages to go from phoneme-to-vector space [P2V] (to do distance matching such that similar sounds are near each other in the vector space).

However, generating a phoneme is language dependent and at the time the G2P used only had a way of generating as if it was an English speaker. This implies the distance matching is done as if it sounds similar to an English speaker. The trivial case is if everything being compared is an English word/name. In the context of an English speaker trying to pronounce a foreign name in their contact list, this may make sense to compare non-English words/names. However, in the context of a non-English native speaker disambiguating between native sounding words/names/etc., especially if the only difference between the words are sounds acute to English but make a big meaningful impact in the native language, then I would expect you'd get poor results.

To support other languages we would need a way to produce G2P for that language. Hopefully with an engine that scales easily to other languages. There exist sources for P2V mappings already so that component is doable.