Open qip opened 3 years ago
After digging into it a little bit, it's more of a kuroshiro - mecab analyzer - ipadic mixed issue:
ユニ・チャーム
itself doesn't need to be converted, but nevertheless kuroshiro sends it to analyzer, while in ipadic, it returns ユニチャーム
as reading (check ipadic csvs for more examples):
$ echo "ユニ・チャーム" | mecab
ユニ・チャーム 名詞,固有名詞,組織,*,*,*,ユニ・チャーム,ユニチャーム,ユニチャーム
EOS
As result, after analyzer.parse()
and patchToken()
, the token end up being this:
[
{
surface_form: 'ユニ・チャーム',
pos: '名詞',
pos_detail_1: '固有名詞',
pos_detail_2: '組織',
pos_detail_3: '*',
conjugated_type: '*',
conjugated_form: '*',
basic_form: 'ユニ・チャーム',
reading: 'ユニチャーム',
pronunciation: 'ユニチャーム'
}
]
While in core.js, hiragana and katakana are processed in this way:
for (let c2 = 0; c2 < tokens[i].surface_form.length; c2++) {
notations.push([tokens[i].surface_form[c2], 2, toRawHiragana(tokens[i].reading[c2]), (tokens[i].pronunciation && tokens[i].pronunciation[c2]) || tokens[i].reading[c2]]);
}
And the issue is, the aforementioned token has this property reading
shorter than surface_form
, which makes the loop fail at the last character of token.reading
, which is undefined
that toRawHiragana()
won't handle.
A quick dirty fix is to update toRawHiragana()
to check on input first:
const toRawHiragana = function (str) {
if (!str) return '';
return [...str].map((ch) => {
if (ch > "\u30a0" && ch < "\u30f7") {
return String.fromCharCode(ch.charCodeAt(0) + KATAKANA_HIRAGANA_SHIFT);
}
return ch;
}).join("");
};
Code above throws an error complaining about converting undefined to hiragana, with mecab ipadic(-neologd):