Open mainrs opened 2 years ago
There's a blog post about the segmentation algorithm, but the secret sauce is really the scoring algorithm, which was built in an ad-hoc manner over the years to split sentences as well as possible. The reference implementation (which is Ichiran) is perhaps the only possible way to describe it.
Thanks for the explanation!
Do you have some kind of test suite in-place to prevent regressions when making changes to the algorith? Maybe a sentence corpus with its segmentations and comparing it to the new segmentations on each new release?
If not, would there be interest in having something like this? I am not a native Japanese speaker so if you happen to know a high-quality corpus, let me know!
Yes, this is the suite, it's not particularly thorough, I was mostly including corner cases for the segmentations I wanted to fix.
The core algorithm actually seems not too different from other lattice based segmenters (MeCab et al. are all in this family), and basically all end up constructing a DAG of possible segmentations and finding best possible path. So the real difference I think isn't the algorithm itself but how scores are assigned to the individual segments.
I believe MeCab was trained on a corpus and has both an individual segment cost (which is probably determined based on frequency) as well as a connection cost, that allows for context-dependence (e.g. capturing which segments are likely to follow or precede other segments, which also loosely captures grammatical rules). The actual way these were derived was via conditional random field, but the math involved there looks too daunting. (I think when they say CRF in the mecab paper they are actually referring to linear chain CRF, and given an annotated corpus the CRF learns the conditional probability of transitioning from one segment to the next which we use as a weight in the graph)
I think the main problem with MeCab is that the dictionary used is poor quality (incomplete words, inconsistent segmentation, too granula) and the tokenization is too fine-grained compared to what you'd expect (e.g. tabeteiru would get parsed into tabe + te + iru because its corpus was annotated for this fine-grained segmentation) so people have to spend a lot of effort patching up the results (Cutlet does this).
The authors of Sudachi tokenizer noted this issue: https://aclanthology.org/L18-1355.pdf and from what i can see it uses the same CRF training scheme to derive weights as MeCab but has less granular tokenizations so gives better quality results.
Based on your blog post, what Ichiran does is I believe skip the need for corpus entirely, and go with costs purely derived from the dictionary length along with some heuristics inspired by knowledge of the grammar (e.g. conjugation, particles).
I'm actually surprised no one has explored these pure-dictionary segmenters more, since they seem much simpler to implement, have 0 training cost, much higher quality output, and more predictable behavior than the CRF trained ones.
Edit: I do suspect one reason why MeCab is preferred in industry is because if you look at the context of named entity recognition (segmentation for places, people, basically involving proper nouns) CRF can handle this case better without degrading the segmentation quality, because it can effectively "recognize" (thanks to the connection costs) of when words are likely being used as proper nouns. Whereas with a pure dictionary approach, incorporating proper nouns might start to degrade segmentation.
I'm actually surprised no one has explored these pure-dictionary segmenters more, since they seem much simpler to implement, have 0 training cost, much higher quality output, and more predictable behavior than the CRF trained ones.
I guess the drawback is that you have to manually add a lot of scoring costs for edge cases or exceptions in grammar or otherwise. I do agree that I get more quality output from ichiran than any other system I have tried!
manually add a lot of scoring costs for edge cases or exceptions in grammar or otherwise
You'd have to do the same with MeCab though, because putting aside the granularity of tokenization in some cases its segmentation is wrong. And it's actually easier to reason about pure-dictionary based system since you can systematically change the heuristics themselves, rather than trying to change the weights assigned to individual tokens in an ad-hoc manner.
Which is why I think the main reason why pure-dictionary is not used that much (because surely the smart people must have thought about this) is because it may not scale well when you add in lots of named entities (if you consider the contexts that tokenizers are used for NLP workloads in real industry, most of sentences will have lots of named entities in them).
You could try to use more heuristics to integrate a named entity corpus into Ichiran without affecting quality of regular segmentations (e.g. recognizing that named entities will usually follow a particle) but at some point it feels like a tradeoff where going with the corpus-training approach becomes more appealing.
Edit: There might also be cases where the weights derived via corpus training can better handle segmenting long strings of kanji characters that should be broken up, but whose naive dictionary segmentation is ambiguous. I don't know if there are very many such cases in "real-world" text you encounter, but the implicit frequencies and context learned via corpus training could better handle this.
Would a hybrid approach work? Maybe create a NER model who's only task is to derive entities. And use that as some form of input hint to ichiran.
Yeah that would certainly work I think (in fact CRF is one of the widely used ways to do NER, but there's probably other algorithms I'm not aware of). You could do all sorts of feature engineering to try to create a better scoring function if you had access to a corpus of annotated data to train on (e.g. maybe take into account part of speech transition probability, like MeCab does).
In terms of corpuses available to the public, there's KWDLC and Kyoto Corpus, but I don't know if they're of high quality or not.
Also this is unrelated to the discussion but I suspect Transformer models (GPT et al.) should do very well at segmentation, given that handling sequential data is their forté. But that's like using a nuke to kill a mosquito.
Also for anyone following and interested in Japanese NLP, the best books here seem to be all in Japanese. E.g. 形態素解析の理論と実装 (Theory and Implementation of Morphological Analyzers)
by the author of MeCab himself.
I'd like to integrate segmentation into an app I wrote. As far as I know, it's not possible to run Lisp on Android.
Is there some kind of documentation that explains how the segmentation works? Even if it takes more time, I'd like to port it over to Kotlin.