taishi-i / nagisa

A Japanese tokenizer based on recurrent neural networks
https://huggingface.co/spaces/taishi-i/nagisa-demo
MIT License
379 stars 22 forks source link

What's the tagset used by Nagisa's POS tagger? #7

Closed BLKSerene closed 5 years ago

BLKSerene commented 5 years ago

Could you please list the tagset used by Nagisa's POS tagger? I'm asking this since I'm trying to convert japanese POS tags to universal POS tags.

taishi-i commented 5 years ago

Hi @BLKSerene

Nagisa uses UniDic's POS tags (https://directory.fsf.org/wiki/Unidic-mecab). You can find a list of Nagisa's POS tags by the following code.

import nagisa
print(nagisa.tagger.postags)
#=> ['動詞',  '空白', '記号', '副詞', '接尾辞', 'ローマ字文', '接続詞', '漢文', 'oov', '接頭辞', '助詞', '英単語', '連体詞','助動詞','形容詞','未知語','名詞','URL','補助記号','言いよどみ','代名詞','web誤脱','感動詞','形状詞']

# This is English translations for Nagisa's POS tags
ja2en = {
 '動詞': "verb",
 '空白': "whitespace",
 '記号': "symbol",
 '副詞': "adverb",
 '接尾辞': "suffix",
 'ローマ字文': "latin_alphabet",
 '接続詞': "conjunction",
 '漢文': "chinese_writing",
 'oov': "unknown_words",
 '接頭辞': "prefix",
 '助詞': "particle",
 '英単語': "english word",
 '連体詞': "adnominal",
 '助動詞': "auxiliary_verb",
 '形容詞': "adjective",
 '未知語': "unknown_words",
 '名詞': "noun",
 'URL': "url",
 '補助記号': "Supsym.",
 '言いよどみ': "hesitation",
 '代名詞': "pronoun",
 'web誤脱': "errors_omissions",
 '感動詞': "interjection",
 '形状詞': "adjectival_noun"
 }

If you want to convert Japanese POS tags to universal POS tags, please refer to the official link for English translations of UniDic POS tags. https://gist.github.com/masayu-a/e3eee0637c07d4019ec9

BLKSerene commented 5 years ago

Thanks a lot for the useful information!