Add functionality to detect transliteration system(s) used for a given phrase

interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem

Other

11 stars 30 forks source link

Add functionality to detect transliteration system(s) used for a given phrase #728

Closed ronaldtse closed 3 years ago

ronaldtse commented 3 years ago

Input: string Output: what transliteration systems have output that match this phrase. List out exact matches and close matches (based on edit distance?)

ronaldtse commented 3 years ago

One sample process is to be able to correct the transliteration entries in the GNDB. It seems that quite a number of the transliteration pairs do not use the correct system.

webdev778 commented 3 years ago

An idea: what if there is more than one input:

Input string (eg. こんいちは)
Expected output (eg. konnichiha)
(Optional) Character spaces to consider (eg. Japn-Latn, or -Latn, or -*)

The return could be something like: a Hash[String -> Float], where Strings are maps tested, and Floats are a similarity score (which could be a Levenshtein distance, percentage of matching characters or other kind of String distance). We could also use #transliterate_each method to consider all possible transliterations.

The idea as presented in this post could be easy to implement only if we know the input.

ronaldtse commented 3 years ago

Agree with the return data type of the hash.

In 3 you probably mean “conversation system selection”, as a way to decide what systems to try.

We could also implement several string distance scores for users to choose from.

ronaldtse commented 3 years ago

Thanks @webdev778 ! Can you also help add documentation for interscript.org?

webdev778 commented 3 years ago

PR #731