miurahr / pykakasi

Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
https://codeberg.org/miurahr/pykakasi
GNU General Public License v3.0
421 stars 54 forks source link

pykakasi duplicates characters when dealing with unusual unicode characters such as U+00D7 "MULTIPLICATION SIGN" #150

Closed curiousjp closed 2 years ago

curiousjp commented 3 years ago

Describe the bug When providing convert() with a string containing Unicode character U+00D7, results list provides an empty dictionary for that character, but then reduplicates the preceding item. The examples given below might provide a clearer explanation.

To Reproduce Steps to reproduce the behavior:

import pykakasi
kakasi = pykakasi.kakasi()
print( kakasi.convert( "三x五" ) )
[{'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}, {'orig': 'x', 'hira': 'x', 'kana': 'x', 'hepburn': 'x', 'kunrei': 'x', 'passport': 'x'}, {'orig': '五', 'hira': 'ご', 'kana': 'ゴ', 'hepburn': 'go', 'kunrei': 'go', 'passport': 'go'}]
print( len( kakasi.convert( "三x五" ) ) )
3

print( kakasi.convert( "三×五" ) )
[{'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}, {'orig': '×', 'hira': '', 'kana': '', 'hepburn': '', 'kunrei': '', 'passport': ''}, {'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}, {'orig': '五', 'hira': 'ご', 'kana': 'ゴ', 'hepburn': 'go', 'kunrei': 'go', 'passport': 'go'}]
print( len( kakasi.convert( "三×五" ) ) )
4

Expected behavior

print( kakasi.convert( "三×五" ) )
[{'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}, {'orig': '×', 'hira': '×', 'kana': '×', 'hepburn': '×', 'kunrei': '×', 'passport': '×'}, {'orig': '五', 'hira': 'ご', 'kana': 'ゴ', 'hepburn': 'go', 'kunrei': 'go', 'passport': 'go'}]
print( len( kakasi.convert( "三×五" ) ) )
3

Environment (please complete the following information):

Additional context Problem discovered while parsing ebook titles.

miurahr commented 2 years ago

152 adds an implement to support Latin-1 characters.

Kakasi is originally born in Japan in EUC_JP encoding, so Latin-1 characters are basically out-of-scope.

miurahr commented 2 years ago

A fix is released as v2.3.0b1.

zhangfeiran commented 2 years ago

I installed v2.3.0b, but the duplication bug still exists for some other Unicode characters:

import pykakasi
kakasi = pykakasi.kakasi()
print(kakasi.convert('三'))
# [{'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}]
print(kakasi.convert('三。'))
# [{'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}, {'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}]
print(kakasi.convert('「三」'))
# [{'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}, {'orig': '三', 'hira': 'さん', 'kana': 'サン', 'hepburn': 'san', 'kunrei': 'san', 'passport': 'san'}]

。 「 」 is U+FF61, U+FF62, U+FF63, coincidentally.