Closed SpongebobSquamirez closed 5 years ago
Thank you for your request. Now I am preparing for comparison with some morphological analyzers (MeCab, KyTea, and Sudachi). Please wait a few days for the result.
I summarized the simple comparison result, so please refer to it.
I used the TMU twitter corpus (https://github.com/tmu-nlp/TwitterCorpus) for evaluating Japanese word segmentation and POS tagging tasks. This dataset is publicly available corpus and annotated with gold word segmentation and POS tags.
The TMU twitter corpus is annotated according to the Short Word Unit (https://pj.ninjal.ac.jp/corpus_center/bccwj/en/morphology.html) . So, I need to compare morphological analyzers that can output the results in the SWU. I used the following morphological analyzers for comparison.
MeCab
KyTea
SudachiPy
In order to make a fair comparison, the following preprocessing is carried out.
MeCab
KyTea
SudachiPy
Finally, I extracted 462 tweets (9596 tokens) from the TMU corpus and used it as the test data for the experiment.
As an evaluation metric, we use the balanced F-measure to evaluate the performance of word segmentation and POS tagging. Speed is the time to finish processing all tweets in the test data.
Method | F1 (Word segmentation) | F1 (POS tagging) | Speed | |
---|---|---|---|---|
MeCab | Word-lattice/CRF | 81.07 | 71.61 | 0.031s |
KyTea | Point-wise/SVM | 87.57 | 77.79 | 0.284s |
SudachiPy | Word-lattice/CRF+Manual Cost Adjustment | 82.67 | 73.70 | 3.253s |
nagisa | Sequence-labeling/BILSTM-CRF | 85.92 | 75.10 | 1.695s |
KyTea got the best F1-scores on word segmentation and POS tagging. However, the TMU corpus was annotated by correcting the KyTea's output results, it is a possibility that the F1-score got slightly high. MeCab is the fastest system in Japanese morphological analyzers. MeCab is the best choice when processing with large amounts of text.
Nagisa is good at capturing kaomoji (e.g, ( ́ω`)) compared to other systems by using the character-based BILSTM. It can also be installed by a single pip install command (For use with python, other systems have to go through some installation procedures.) Additionally, nagisa has useful post processing functions (e.g, Extarcting/Filtering specific POS-tags from a text).
If you have an interest in nagisa, please try it. Thank you!
Thank you, @taishi-i, but unidic-mecab-2.1.2
seems rather old for the comparison. Please consult recent version of UniDic https://unidic.ninjal.ac.jp/download#unidic_bccwj .
Sorry for the late reply.
This result shows that unidic-2.3.0
is better than unidic-2.1.2.
Method | F1 (Word segmentation) | F1 (POS tagging) | Speed | |
---|---|---|---|---|
MeCab (unidic-mecab-2.1.2) | Word-lattice/CRF | 79.96 | 70.83 | 0.034 |
MeCab (unidic-cwj-2.3.0) | Word-lattice/CRF | 80.81 | 71.83 | 0.046 |
MeCab (unidic-csj-2.3.0) | Word-lattice/CRF | 80.80 | 71.83 | 0.057 |
SudachiPy | Word-lattice/CRF+Manual Cost Adjustment | 82.63 | 73.71 | 3.397 |
nagisa | Sequence-labeling/BILSTM-CRF | 85.92 | 75.38 | 1.692 |
KyTea | Point-wise/SVM | 87.60 | 77.82 | 0.228 |
The result is slightly different from the previous ones because the time of downloading the corpus is different. Because the TMU twitter corpus is obtained by the Twitter API, the data size may change depending on whether the account is locked or the tweet is deleted.
Could you include some notes briefly comparing this to other parses like Mecab? Mecab includes a comparison to other tokenizers/parsers. I think users would greatly benefit from knowing things like parsing speed comparisons, accuracy, and other slight differences/nuances/use cases.