interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Mandarin transliteration in HK data #200

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

Mandarin transliteration in HK data

(Originally in #39)

Extracted a list of Mandarin transliteration from the Hong Kong dataset, and created a toneless pinyin map for testing.

  1. Spacing Hanyu Pinyin (zho_Hani2Latn_GCH_1979) has detailed rules on word segmentation. These rules have not yet been implemented. Whether a space is needed depends on a number of factors, and cannot be handled by mapping rules alone. For example, these place names below all contain the character "灣", but only the first and third rows below are transliterated as one word.

image

A separate parsing layer may be needed in order to handle the insertion of space (related to #44 ).

  1. Syllable separator for zero-onset syllables Syllables begin with a, o, and e should be preceded by a syllable separator unless it is the first syllable of a word, e.g. 西安 Xi’an.

  2. Hong Kong specific reading 涌: Chong 仔: Zai 咀: Zui (<嘴)

Toneless Pinyin Map with HK place names cn-chn-Hans-Latn-pinyin_toneless.yaml.zip

Originally posted by @chaaklau in https://github.com/riboseinc/interoperable-transliteration/issues/39#issuecomment-571932571

ronaldtse commented 4 years ago

Also see https://github.com/riboseinc/interoperable-transliteration/issues/39#issuecomment-571974307