polm / fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
MIT License
389 stars 31 forks source link

Korean Support #4

Closed polm closed 4 years ago

polm commented 4 years ago

It'd be nice to support Korean. A simple way to do this would be to subclass the tagger with a KoreanTagger and overwrite the field names, or allow fields to be passed in at creation time.

The tagspec for mecab-ko-dict is here. 2.0 seems to be the most recent one so I guess it makes sense to support that.

Field names and meaning based on Google translate:

Original English
품사 태그 part of speech tag
의미 부류 meaning type
종성 유무 patchim presence (T or F)
읽기 reading (pronunciation, for hanja?)
타입 type (*/Inflected/Compound/Preanalysis)
첫번째 품사 first pos (for compounds?)
마지막 품사 last pos
표현 notes(?) (seems to specify composition of compounds, uses / as delimiter)

In Korean a fork of MeCab is used, it looks like one difference is how whitespace is handled. Not sure if fugashi will just work with it, but since natto-py seems to work there should be a way to support it.

polm commented 4 years ago

Korean support is in since 0.1.8, but it needs more testing. If anyone could take a look at it and make sure it's OK that'd be much appreciated.

polm commented 4 years ago

It's not clear anyone has used the Korean support and I still don't have a good way of testing it. Since it turns out there's a well-maintained Korean-specific NLP library, KoNLPy, that wraps MeCab, I'm going to remove Korean support from fugashi for now. If anyone has a need for it I can try to add it back in later.

One other thing to note is that mecab-ko makes some Korean-specific changes to Mecab's internal scoring algo, so it doesn't work with fugashi wheels anyway.