Open bai-yi-bai opened 3 years ago
I'm returning two years later to provide some additional help. The best advice I can give is to test your input before trying to run dragonmapper on it.
.lower()
Before Calling .to_zhuyin()
Use .lower() to avoid errors like this:
ValueError: Not a valid syllable: Ān
There are a lot of strange encodings out there. Dragonmapper doesn't work with these characters. Sanitize the input.
yourstring.replace('á','á').replace('ǎ','ǎ').replace('ē','ē').replace('é', 'é').replace('ī','ī').replace('ǐ','ǐ').replace('ì','ì').replace('ò','ò').replace('ū','ū').replace( 'ǔ','ǔ').replace('ù', 'ù')
to_zhuyin()
Let's say you have strings containing single latin consonants:
the_pinyin_input = 'X fēn zhī Y'
the_pinyin_input = 'X guāng'
Calling dragonmapper.transcriptions.to_zhuyin(the_pinyin_input)
results in ValueError: String is not a valid Chinese transcription.
It is possible to .split() these and run try/except blocks on them, but there might be a better test available:
if 1 in [len(x) for x in the_pinyin_input.split()]:
print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))
This splits the string into a list ['X', 'fēn', 'zhī', 'Y']
, and tests the length of each element. The second list comprehension only operates on the longer elements. Unfortunately, test operates on a lot of strings with non-letter characters, such as / and .
Here's a slightly better alternative, testing for consonants in a string.
import string
if any(x in string.ascii_lowercase.strip('aeiu') for x in the_pinyin_input.split()):
print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))
This uses the same list comprehension. Unfortunately this if statement runs on 的 (de, ㄉㄜ˙).
Hope this helps some people in the future.
Hi, Dragonmapper is an awesome library. I am using it (0.2.6) for many projects, which use CEDICT as a data source for further text processing. I found problems with numbered pinyin, accented pinyin, and zhuyin fuhao transcriptions.
Before I begin, I want to note I am not a Mandarin expert, therefore I don't know if my suggestions are the correct ones. A lot of my suggested clean up edits to CEDIT have been accepted. However, since CEDIT is not in a standard format like .csv, I had to build my own parser, read the data line by line, and .split() it to feed Dragonmapper. I'm not sure whether every issue I've discovered should be solved by Dragonmapper, I will simply present the problems I needed to work around and leave it up to discussion.
Issues
Numbered Pinyin do not convert to Accented Pinyin
More than 2000 entries in the CEDICT have 'u:' combinations. 'yo1' and 'yo5' also have a combined 5 items in CEDICT which Dragonmapper cannot convert these items from numbered pinyin to accented pinyin. I found it necessary to loop through in this order:
These items raise 'ValueError: Not a valid syllable:' exceptions.
Accented pinyin which do not convert to zhuyin fuhao
I also encountered the following items which do not convert correctly:
Already noted in issue 27
https://github.com/tsroten/dragonmapper/issues/27
Taiwanese Pronunciation Exceptions
I found it necessary to skip items which contained Taiwanese pronunciations of
['khè' ,'goá' ,'khàu' ,'ô' ,'yai2']
. I'm not sure anything can be done about this with Dragonmapper.dragonmapper.hanzi.to_zhuyin('goá')
Results in a 'ValueError: Not a valid syllable: o5'