Words do not get divided properly when small letters (捨て仮名) are included in word

rolzy commented 5 years ago

Version: 0.996 OS: Ubuntu 18.04 (Windows Subsystem for Linux)

Hello,

I have found some cases where a group of hiragana words are analyzed as one word (probably unk) when small letters (捨て仮名) are included in the word.

Example:

$ mecab
出ておりますでしょうか
出      動詞,自立,*,*,一段,連用形,出る,デ,デ
て      助詞,接続助詞,*,*,*,*,て,テ,テ
おり    動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
でしょ  助動詞,*,*,*,特殊・デス,未然形,です,デショ,デショ
う      助動詞,*,*,*,不変化型,基本形,う,ウ,ウ
か      助詞,副助詞／並立助詞／終助詞,*,*,*,*,か,カ,カ
EOS

出ておりますでしょうかあっ
出      動詞,自立,*,*,一段,連用形,出る,デ,デ
て      助詞,接続助詞,*,*,*,*,て,テ,テ
おり    動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
でしょ  助動詞,*,*,*,特殊・デス,未然形,です,デショ,デショ
う      助動詞,*,*,*,不変化型,基本形,う,ウ,ウ
か      助詞,副助詞／並立助詞／終助詞,*,*,*,*,か,カ,カ
あっ    感動詞,*,*,*,*,*,あっ,アッ,アッ
EOS

出ておりますでしょうかあっはいはいはい
出      動詞,自立,*,*,一段,連用形,出る,デ,デ
て      助詞,接続助詞,*,*,*,*,て,テ,テ
おり    動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
でし    助動詞,*,*,*,特殊・デス,連用形,です,デシ,デシ
ょうかあっはいはいはい  名詞,一般,*,*,*,*,*
EOS

出ておりますでしょうかあっはいはいはいじゃっすいませんちょっとお待ちいただけたら
出      動詞,自立,*,*,一段,連用形,出る,デ,デ
て      助詞,接続助詞,*,*,*,*,て,テ,テ
おり    動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
でし    助動詞,*,*,*,特殊・デス,連用形,です,デシ,デシ
ょうかあっはいはいはいじゃっすいませんちょっとお        名詞,一般,*,*,*,*,*
待ち    名詞,接尾,一般,*,*,*,待ち,マチ,マチ
いただけ        動詞,自立,*,*,一段,連用形,いただける,イタダケ,イタダケ
たら    助動詞,*,*,*,特殊・タ,仮定形,た,タラ,タラ
EOS

With a small number of characters after the character 「ょ」, mecab can still divide the text into 「でしょ」and the rest, which is what we want.

However, if there are too many hiragana characters after 「ょ」, mecab handles all hiragana after 「ょ」until a kanji character appears.

I couldn't find any report on this so far. Apologies if it is a duplicate. Is there a way to suppress this behavior?

Thanks!

rolzy commented 5 years ago

I have found another example, this time where the 捨て仮名 is not at the front but still included:

ありがとうございますなんかはそうですね
ありがとう      感動詞,*,*,*,*,*,ありがとう,アリガトウ,アリガトー
ござい  助動詞,*,*,*,五段・ラ行特殊,連用形,ござる,ゴザイ,ゴザイ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
なんか  助詞,副助詞,*,*,*,*,なんか,ナンカ,ナンカ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
そうですね      フィラー,*,*,*,*,*,そうですね,ソウデスネ,ソーデスネ
EOS
ありがとうございますなんかはそうですねちょっと
ありがとう      感動詞,*,*,*,*,*,ありがとう,アリガトウ,アリガトー
ござい  助動詞,*,*,*,五段・ラ行特殊,連用形,ござる,ゴザイ,ゴザイ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
なんか  助詞,副助詞,*,*,*,*,なんか,ナンカ,ナンカ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
そうですね      フィラー,*,*,*,*,*,そうですね,ソウデスネ,ソーデスネ
ちょっと        副詞,助詞類接続,*,*,*,*,ちょっと,チョット,チョット
EOS
ありがとうございますなんかはそうですねちょっとやっぱ
ありがとう      感動詞,*,*,*,*,*,ありがとう,アリガトウ,アリガトー
ご      接頭詞,名詞接続,*,*,*,*,ご,ゴ,ゴ
ざいますなんかはそうですねちょっとやっぱ        名詞,一般,*,*,*,*,*
EOS

polm commented 5 years ago

Hey, this is not a bug but just an example of the limitations of Mecab. Depending on the settings for unknown words, some sequences of words can be combined together as an UNK (未知語) instead of being split. It's particularly easy to cause this if the sequences of words you're using aren't similar to the kind used in the training data for the dictionary (mostly newspaper-article type stuff). This is because Mecab uses not only the existence of words in the dictionary, but also transitions between different parts of speech when calculating the best place to split words.

In particular, Mecab just treats sutegana the same as kana. You can see this in the character class definitions in the char.def file distributed with ipadic, which is the dictionary you appear to be using.

You can read more about unk processing here.

rolzy commented 5 years ago

Thank you very much for your reply!

taku910 / mecab

Words do not get divided properly when small letters (捨て仮名) are included in word #53