taishi-i / nagisa

A Japanese tokenizer based on recurrent neural networks
https://huggingface.co/spaces/taishi-i/nagisa-demo
MIT License
379 stars 22 forks source link

request: comparison to other tokenizers/PoS taggers #6

Closed SpongebobSquamirez closed 5 years ago

SpongebobSquamirez commented 5 years ago

Could you include some notes briefly comparing this to other parses like Mecab? Mecab includes a comparison to other tokenizers/parsers. I think users would greatly benefit from knowing things like parsing speed comparisons, accuracy, and other slight differences/nuances/use cases.

taishi-i commented 5 years ago

Thank you for your request. Now I am preparing for comparison with some morphological analyzers (MeCab, KyTea, and Sudachi). Please wait a few days for the result.

taishi-i commented 5 years ago

I summarized the simple comparison result, so please refer to it.

Test data

I used the TMU twitter corpus (https://github.com/tmu-nlp/TwitterCorpus) for evaluating Japanese word segmentation and POS tagging tasks. This dataset is publicly available corpus and annotated with gold word segmentation and POS tags.

Software

The TMU twitter corpus is annotated according to the Short Word Unit (https://pj.ninjal.ac.jp/corpus_center/bccwj/en/morphology.html) . So, I need to compare morphological analyzers that can output the results in the SWU. I used the following morphological analyzers for comparison.

Preprocessing

In order to make a fair comparison, the following preprocessing is carried out.

Finally, I extracted 462 tweets (9596 tokens) from the TMU corpus and used it as the test data for the experiment.

Result

As an evaluation metric, we use the balanced F-measure to evaluate the performance of word segmentation and POS tagging. Speed is the time to finish processing all tweets in the test data.

Method F1 (Word segmentation) F1 (POS tagging) Speed
MeCab Word-lattice/CRF 81.07 71.61 0.031s
KyTea Point-wise/SVM 87.57 77.79 0.284s
SudachiPy Word-lattice/CRF+Manual Cost Adjustment 82.67 73.70 3.253s
nagisa Sequence-labeling/BILSTM-CRF 85.92 75.10 1.695s

Discussion

KyTea got the best F1-scores on word segmentation and POS tagging. However, the TMU corpus was annotated by correcting the KyTea's output results, it is a possibility that the F1-score got slightly high. MeCab is the fastest system in Japanese morphological analyzers. MeCab is the best choice when processing with large amounts of text.

Nagisa is good at capturing kaomoji (e.g, ( ́ω`)) compared to other systems by using the character-based BILSTM. It can also be installed by a single pip install command (For use with python, other systems have to go through some installation procedures.) Additionally, nagisa has useful post processing functions (e.g, Extarcting/Filtering specific POS-tags from a text).

If you have an interest in nagisa, please try it. Thank you!

KoichiYasuoka commented 5 years ago

Thank you, @taishi-i, but unidic-mecab-2.1.2 seems rather old for the comparison. Please consult recent version of UniDic https://unidic.ninjal.ac.jp/download#unidic_bccwj .

taishi-i commented 5 years ago

Sorry for the late reply. This result shows that unidic-2.3.0 is better than unidic-2.1.2.

Method F1 (Word segmentation) F1 (POS tagging) Speed
MeCab (unidic-mecab-2.1.2) Word-lattice/CRF 79.96 70.83 0.034
MeCab (unidic-cwj-2.3.0) Word-lattice/CRF 80.81 71.83 0.046
MeCab (unidic-csj-2.3.0) Word-lattice/CRF 80.80 71.83 0.057
SudachiPy Word-lattice/CRF+Manual Cost Adjustment 82.63 73.71 3.397
nagisa Sequence-labeling/BILSTM-CRF 85.92 75.38 1.692
KyTea Point-wise/SVM 87.60 77.82 0.228

The result is slightly different from the previous ones because the time of downloading the corpus is different. Because the TMU twitter corpus is obtained by the Twitter API, the data size may change depending on whether the account is locked or the tweet is deleted.