neologd / mecab-ipadic-neologd

Neologism dictionary based on the language resources on the Web for mecab-ipadic
Other
2.7k stars 288 forks source link

e-mail and URL tokenization #60

Closed lautel closed 4 years ago

lautel commented 5 years ago

Motivation and Goal

Instead of breaking down an email address and/or an URL, it could be a desirable option to be able to identify email addresses and URLs as a single token. See example below to compare current behavior to the suggested one.

Sample code

import MeCab
mecab = MeCab.Tagger("-Ochasen -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd")

text = "中川さんのメールはnakagawa@xxxx.co.jpです"
print(mecab.parse(text))

Output

中川    ナカガワ中川    名詞-固有名詞-人名-姓
さん    サン    さん    名詞-接尾-人名
の      ノ      の      助詞-連体化
メール   メール   メール   名詞-サ変接続
は      ハ      は      助詞-係助詞
nakagawa        nakagawa        nakagawa        名詞-固有名詞-組織
@       @       @       記号-一般
xxxx    イエナイXXXX    名詞-固有名詞-一般
.       .       .       記号-一般
co.jp   シーオージェイピー co.jp   名詞-固有名詞-一般
です    デス    です    助動詞  特殊・デ 基本形
EOS

Desirable output

中川    ナカガワ中川    名詞-固有名詞-人名-姓
さん    サン    さん    名詞-接尾-人名
の      ノ      の      助詞-連体化
メール   メール   メール   名詞-サ変接続
は      ハ      は      助詞-係助詞
nakagawa@xxxx.co.jp        [...]
です    デス    です    助動詞  特殊・デ 基本形
EOS
neologd commented 4 years ago

Thank you for the practical request.

In conclusion, the best way to solve the problem you pointed out is by pre- and post-processing, not by the morphological analysis process itself. We think it's easier to control the splitting performance.

Since there are too many possibilities for notation of e-mail addresses and URLs, all patterns cannot be recorded in the dictionary in the developing phase.

If you want to resolve the problem by preprocessing, it is convenient to replace the email address or URL with an unknown word and restore it after parsing.

If the solution is post-processing, the morphological analysis results should be chained with CRF e.t.c.

These solutions are very common and are found in various textbooks of natural language processing.

If you don't need phonetic characters for Japanese processing, you might want to use Juman++ or Nagisa.

Thank you very much.