hexenq / kuroshiro

Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
https://kuroshiro.org
MIT License
837 stars 94 forks source link

Error on edge case of mecab ipadic hiragana conversion #80

Open qip opened 3 years ago

qip commented 3 years ago
const result = await kuroshiro.convert("ユニ・チャーム、シリーズ最軽量の「超快適マスク SMART COLOR」", { to: "hiragana" });

Code above throws an error complaining about converting undefined to hiragana, with mecab ipadic(-neologd):

/home/user/mecab/node_modules/kuroshiro/lib/util.js:7
function _toConsumableArray(arr) { if (Array.isArray(arr)) { for (var i = 0, arr2 = Array(arr.length); i < arr.length; i++) { arr2[i] = arr[i]; } return arr2; } else { return Array.from(arr); } }
                                                                                                                                                                                     ^

TypeError: undefined is not iterable (cannot read property Symbol(Symbol.iterator))
    at Function.from (<anonymous>)
    at _toConsumableArray (/home/user/mecab/node_modules/kuroshiro/lib/util.js:7:182)
    at toRawHiragana (/home/user/mecab/node_modules/kuroshiro/lib/util.js:142:22)
    at Kuroshiro._callee2$ (/home/user/mecab/node_modules/kuroshiro/lib/core.js:341:108)
    at tryCatch (/home/user/mecab/node_modules/regenerator-runtime/runtime.js:62:40)
    at Generator.invoke [as _invoke] (/home/user/mecab/node_modules/regenerator-runtime/runtime.js:296:22)
    at Generator.prototype.<computed> [as next] (/home/user/mecab/node_modules/regenerator-runtime/runtime.js:114:21)
    at step (/home/user/mecab/node_modules/kuroshiro/lib/core.js:19:191)
    at /home/user/mecab/node_modules/kuroshiro/lib/core.js:19:361
qip commented 3 years ago

After digging into it a little bit, it's more of a kuroshiro - mecab analyzer - ipadic mixed issue: ユニ・チャーム itself doesn't need to be converted, but nevertheless kuroshiro sends it to analyzer, while in ipadic, it returns ユニチャーム as reading (check ipadic csvs for more examples):

$ echo "ユニ・チャーム" | mecab
ユニ・チャーム  名詞,固有名詞,組織,*,*,*,ユニ・チャーム,ユニチャーム,ユニチャーム
EOS

As result, after analyzer.parse() and patchToken(), the token end up being this:

[
  {
    surface_form: 'ユニ・チャーム',
    pos: '名詞',
    pos_detail_1: '固有名詞',
    pos_detail_2: '組織',
    pos_detail_3: '*',
    conjugated_type: '*',
    conjugated_form: '*',
    basic_form: 'ユニ・チャーム',
    reading: 'ユニチャーム',
    pronunciation: 'ユニチャーム'
  }
]

While in core.js, hiragana and katakana are processed in this way:

for (let c2 = 0; c2 < tokens[i].surface_form.length; c2++) {
    notations.push([tokens[i].surface_form[c2], 2, toRawHiragana(tokens[i].reading[c2]), (tokens[i].pronunciation && tokens[i].pronunciation[c2]) || tokens[i].reading[c2]]);
}

And the issue is, the aforementioned token has this property reading shorter than surface_form, which makes the loop fail at the last character of token.reading, which is undefined that toRawHiragana() won't handle.

A quick dirty fix is to update toRawHiragana() to check on input first:

const toRawHiragana = function (str) {
    if (!str) return '';
    return [...str].map((ch) => {
        if (ch > "\u30a0" && ch < "\u30f7") {
            return String.fromCharCode(ch.charCodeAt(0) + KATAKANA_HIRAGANA_SHIFT);
        }
        return ch;
    }).join("");
};