Extracted a list of Mandarin transliteration from the Hong Kong dataset, and created a toneless pinyin map for testing.
Spacing
Hanyu Pinyin (zho_Hani2Latn_GCH_1979) has detailed rules on word segmentation. These rules have not yet been implemented. Whether a space is needed depends on a number of factors, and cannot be handled by mapping rules alone. For example, these place names below all contain the character "灣", but only the first and third rows below are transliterated as one word.
A separate parsing layer may be needed in order to handle the insertion of space (related to #44 ).
Syllable separator for zero-onset syllables
Syllables begin with a, o, and e should be preceded by a syllable separator ’ unless it is the first syllable of a word, e.g. 西安 Xi’an.
Hong Kong specific reading
涌: Chong
仔: Zai
咀: Zui (<嘴)
Mandarin transliteration in HK data
(Originally in #39)
Extracted a list of Mandarin transliteration from the Hong Kong dataset, and created a toneless pinyin map for testing.
A separate parsing layer may be needed in order to handle the insertion of space (related to #44 ).
Syllable separator for zero-onset syllables Syllables begin with a, o, and e should be preceded by a syllable separator
’
unless it is the first syllable of a word, e.g. 西安 Xi’an.Hong Kong specific reading 涌: Chong 仔: Zai 咀: Zui (<嘴)
Toneless Pinyin Map with HK place names cn-chn-Hans-Latn-pinyin_toneless.yaml.zip
Originally posted by @chaaklau in https://github.com/riboseinc/interoperable-transliteration/issues/39#issuecomment-571932571