Closed curiousjp closed 2 years ago
Kakasi is originally born in Japan in EUC_JP encoding, so Latin-1 characters are basically out-of-scope.
A fix is released as v2.3.0b1.
I installed v2.3.0b, but the duplication bug still exists for some other Unicode characters:
import pykakasi
kakasi = pykakasi.kakasi()
print(kakasi.convert('三'))
# [{'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}]
print(kakasi.convert('三。'))
# [{'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}, {'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}]
print(kakasi.convert('「三」'))
# [{'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}, {'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}]
。 「 」 is U+FF61, U+FF62, U+FF63, coincidentally.
Describe the bug When providing convert() with a string containing Unicode character U+00D7, results list provides an empty dictionary for that character, but then reduplicates the preceding item. The examples given below might provide a clearer explanation.
To Reproduce Steps to reproduce the behavior:
Expected behavior
Environment (please complete the following information):
Additional context Problem discovered while parsing ebook titles.