SpongebobSquamirez commented 6 years ago

Could you include some notes briefly comparing this to other parses like Mecab? Mecab includes a comparison to other tokenizers/parsers. I think users would greatly benefit from knowing things like parsing speed comparisons, accuracy, and other slight differences/nuances/use cases.

taishi-i commented 6 years ago

Thank you for your request. Now I am preparing for comparison with some morphological analyzers (MeCab, KyTea, and Sudachi). Please wait a few days for the result.

taishi-i commented 5 years ago

I summarized the simple comparison result, so please refer to it.

Test data

I used the TMU twitter corpus (https://github.com/tmu-nlp/TwitterCorpus) for evaluating Japanese word segmentation and POS tagging tasks. This dataset is publicly available corpus and annotated with gold word segmentation and POS tags.

Software

The TMU twitter corpus is annotated according to the Short Word Unit (https://pj.ninjal.ac.jp/corpus_center/bccwj/en/morphology.html) . So, I need to compare morphological analyzers that can output the results in the SWU. I used the following morphological analyzers for comparison.

MeCab
- https://github.com/taku910/mecab
- https://ja.osdn.net/projects/unidic/
KyTea
- http://www.phontron.com/kytea/
SudachiPy
- https://github.com/WorksApplications/SudachiPy

Preprocessing

In order to make a fair comparison, the following preprocessing is carried out.

MeCab
- The input texts and output results are converted to "zenkaku" (全角) characters.
KyTea
- If the POS-tag of output word is "gobi" (語尾), it is joined the previous word.
SudachiPy
- When I obtained IndexError and UnicodeDecodeError, I removed them from the test data.

Finally, I extracted 462 tweets (9596 tokens) from the TMU corpus and used it as the test data for the experiment.

Result

As an evaluation metric, we use the balanced F-measure to evaluate the performance of word segmentation and POS tagging. Speed is the time to finish processing all tweets in the test data.

Method	F1 (Word segmentation)	F1 (POS tagging)	Speed
MeCab	Word-lattice/CRF	81.07	71.61	0.031s
KyTea	Point-wise/SVM	87.57	77.79	0.284s
SudachiPy	Word-lattice/CRF+Manual Cost Adjustment	82.67	73.70	3.253s
nagisa	Sequence-labeling/BILSTM-CRF	85.92	75.10	1.695s

F1-scores
- KyTea > nagisa > SudachiPy > MeCab
Speed
- MeCab > KyTea > nagisa > SudachiPy

Discussion

KyTea got the best F1-scores on word segmentation and POS tagging. However, the TMU corpus was annotated by correcting the KyTea's output results, it is a possibility that the F1-score got slightly high. MeCab is the fastest system in Japanese morphological analyzers. MeCab is the best choice when processing with large amounts of text.

Nagisa is good at capturing kaomoji (e.g, (　́ω`)) compared to other systems by using the character-based BILSTM. It can also be installed by a single pip install command (For use with python, other systems have to go through some installation procedures.) Additionally, nagisa has useful post processing functions (e.g, Extarcting/Filtering specific POS-tags from a text).

If you have an interest in nagisa, please try it. Thank you!

KoichiYasuoka commented 5 years ago

Thank you, @taishi-i, but unidic-mecab-2.1.2 seems rather old for the comparison. Please consult recent version of UniDic https://unidic.ninjal.ac.jp/download#unidic_bccwj .

taishi-i commented 5 years ago

Sorry for the late reply. This result shows that unidic-2.3.0 is better than unidic-2.1.2.

Method	F1 (Word segmentation)	F1 (POS tagging)	Speed
MeCab (unidic-mecab-2.1.2)	Word-lattice/CRF	79.96	70.83	0.034
MeCab (unidic-cwj-2.3.0)	Word-lattice/CRF	80.81	71.83	0.046
MeCab (unidic-csj-2.3.0)	Word-lattice/CRF	80.80	71.83	0.057
SudachiPy	Word-lattice/CRF+Manual Cost Adjustment	82.63	73.71	3.397
nagisa	Sequence-labeling/BILSTM-CRF	85.92	75.38	1.692
KyTea	Point-wise/SVM	87.60	77.82	0.228

The result is slightly different from the previous ones because the time of downloading the corpus is different. Because the TMU twitter corpus is obtained by the Twitter API, the data size may change depending on whether the account is locked or the tweet is deleted.

taishi-i / nagisa

request: comparison to other tokenizers/PoS taggers #6

Test data

Software

Preprocessing

Result

Discussion