Numbered Pinyin issues encountered in CEDICT

tsroten / dragonmapper

Identification and conversion functions for Chinese text processing

MIT License

54 stars 18 forks source link

Hi, Dragonmapper is an awesome library. I am using it (0.2.6) for many projects, which use CEDICT as a data source for further text processing. I found problems with numbered pinyin, accented pinyin, and zhuyin fuhao transcriptions.

Before I begin, I want to note I am not a Mandarin expert, therefore I don't know if my suggestions are the correct ones. A lot of my suggested clean up edits to CEDIT have been accepted. However, since CEDIT is not in a standard format like .csv, I had to build my own parser, read the data line by line, and .split() it to feed Dragonmapper. I'm not sure whether every issue I've discovered should be solved by Dragonmapper, I will simply present the problems I needed to work around and leave it up to discussion.

Issues

Numbered Pinyin do not convert to Accented
Accented pinyin which do not convert to zhuyin fuhao
Already noted in issue 27
Taiwanese pronunciation exceptions

Numbered Pinyin do not convert to Accented Pinyin

More than 2000 entries in the CEDICT have 'u:' combinations. 'yo1' and 'yo5' also have a combined 5 items in CEDICT which Dragonmapper cannot convert these items from numbered pinyin to accented pinyin. I found it necessary to loop through in this order:

'u:4', 'ǜ'
'u:3', 'ǚ'
'u:2', 'ǘ'
'u:1', 'ǖ'
'u:', 'ü'
'yo1', 'yō'
'yo5', 'yo'

These items raise 'ValueError: Not a valid syllable:' exceptions.

Accented pinyin which do not convert to zhuyin fuhao

I also encountered the following items which do not convert correctly:

'ó':'ㄛˊ' # 哦哦 [o2] /oh (interjection indicating doubt or surprise)/
'ò':'ㄛˋ' # 哦哦 [o4] /oh (interjection indicating that one has just learned sth)/
'ō':'ㄛ'
'ǒ':'ㄛˇ'
'yō':'ㄧㄛ'
'yo':'ㄧㄛ˙'
'dia3':'ㄉㄧㄚˇ' # diǎ 嗲嗲 [dia3] /coy/childish/
'm2':'ㄇˊ'
'm4':'ㄇˋ'

Already noted in issue 27

https://github.com/tsroten/dragonmapper/issues/27

'tēi':'ㄊㄨㄟ' # Workaround for 忒忒 [tei1] /(dialect) too/very/also pr. [tui1]/
'eng1':'ㄥ' # Work around for ēng 鞥鞥 [eng1] /reins/

Taiwanese Pronunciation Exceptions

I found it necessary to skip items which contained Taiwanese pronunciations of ['khè' ,'goá' ,'khàu' ,'ô' ,'yai2']. I'm not sure anything can be done about this with Dragonmapper.
dragonmapper.hanzi.to_zhuyin('goá') Results in a 'ValueError: Not a valid syllable: o5'

I'm returning two years later to provide some additional help. The best advice I can give is to test your input before trying to run dragonmapper on it.

Always Use `.lower()` Before Calling `.to_zhuyin()`

Use .lower() to avoid errors like this: ValueError: Not a valid syllable: Ān

Fix Incorrect Pinyin Vowel/Tone Characters

There are a lot of strange encodings out there. Dragonmapper doesn't work with these characters. Sanitize the input.

yourstring.replace('á','á').replace('ǎ','ǎ').replace('ē','ē').replace('é', 'é').replace('ī','ī').replace('ǐ','ǐ').replace('ì','ì').replace('ò','ò').replace('ū','ū').replace( 'ǔ','ǔ').replace('ù', 'ù')

Edge Case: Handling Single Latin Consonants in Strings with `to_zhuyin()`

Let's say you have strings containing single latin consonants:

the_pinyin_input = 'X fēn zhī Y'
the_pinyin_input = 'X guāng'

Calling dragonmapper.transcriptions.to_zhuyin(the_pinyin_input) results in ValueError: String is not a valid Chinese transcription.

It is possible to .split() these and run try/except blocks on them, but there might be a better test available:

if 1 in [len(x) for x in the_pinyin_input.split()]:
    print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))

This splits the string into a list ['X', 'fēn', 'zhī', 'Y'], and tests the length of each element. The second list comprehension only operates on the longer elements. Unfortunately, test operates on a lot of strings with non-letter characters, such as / and .

Here's a slightly better alternative, testing for consonants in a string.

import string
if any(x in string.ascii_lowercase.strip('aeiu') for x in the_pinyin_input.split()):
    print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))

This uses the same list comprehension. Unfortunately this if statement runs on 的 (de, ㄉㄜ˙).

Hope this helps some people in the future.

tsroten / dragonmapper