Decouple Kuromoji's morphological analyser and its dictionary [LUCENE-8816]

mikemccand commented 5 years ago

I've inspired by this mail-list thread. http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E

As many Japanese already know, default built-in dictionary bundled with Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. While it has been slowly obsoleted, well-maintained and/or extended dictionaries risen up in recent years (e.g. mecab-ipadic-neologd, UniDic). To use them with Kuromoji, some attempts/projects/efforts are made in Japan.

However current architecture - dictionary bundled jar - is essentially incompatible with the idea "switch the system dictionary", and developers have difficulties to do so.

Traditionally, the morphological analysis engine (viterbi logic) and the encoded dictionary (language model) had been decoupled (like MeCab, the origin of Kuromoji, or lucene-gosen). So actually decoupling them is a natural idea, and I feel that it's good time to re-think the current architecture.

Also this would be good for advanced users who have customized/re-trained their own system dictionary.

Goals of this issue:

Decouple JapaneseTokenizer itself and encoded system dictionary.
Implement dynamic dictionary load mechanism.
Provide developer-oriented dictionary build tool.

Non-goals:

Provide learner or language model (it's up to users and should be outside the scope).

I have not dove into the code yet, so have no idea about it's easy or difficult at this moment.

Legacy Jira details

LUCENE-8816 by Tomoko Uchida (@mocobeta) on May 28 2019, updated Jan 27 2022 Linked issues:

LUCENE-8817
- LUCENE-4056
- LUCENE-8869

mikemccand commented 5 years ago

if the licensing challenge in LUCENE-4056 is overcome, then it just means only technical hurdles :)

it is possible to now "support" different dictionaries (maybe by storing some data differently depending on dictionary) but I think its important to consider how tests (build system) would work. Some of the data structures etc are complicated, so you really need the unit test suite to know things are working correctly. The current flow (changing build.xml) does support this for advanced users.

I am concerned about a commandline tool because, without testing, the stuff it generates may fail unexpectedly at runtime due to bugs or assumptions...

[Legacy Jira: Robert Muir (@rmuir) on May 28 2019]

mikemccand commented 5 years ago

What if we changed the various dictionary classes to load-on-demand from a configurable classpath-directory, rather than from a single built-in one? If we do that, then users can supply a separate jar containing only the model files, and reference it when initializing the JapaneseAnalyzer.

Also - about testing; I think we can test using the built-in dictionary. Do we need to unit test dictionaries that we don't provide? Or – are you anticipating providing multiple dictionaries as part of the Lucene distro itself? I think both have merit (expose ability to bring your own dictionary) and (provide better dictionaries).

[Legacy Jira: Michael Sokolov (@msokolov) on May 28 2019]

mikemccand commented 5 years ago

From the perspective of this code, the dictionaries are basically in different formats. Does it make sense? It is more complicated than just "different words and costs".

[Legacy Jira: Robert Muir (@rmuir) on May 28 2019]

mikemccand commented 5 years ago

@rcmuir @sokolov: Thanks for the comments and advice!

What if we changed the various dictionary classes to load-on-demand from a configurable classpath-directory, rather than from a single built-in one?

About this point, I thought it would be better the each encoded dictionary has its "name" and users specify the dictionary name to be used at some point of initialization (for safety reasons because there could be multiple re-trained models), rather than completely auto discovering something like SLF4J does so. Just an idea.

[Legacy Jira: Tomoko Uchida (@mocobeta) on May 28 2019]

mikemccand commented 5 years ago

We discussed this when we added the Korean module and said that we could have a separate module to handle "mecab-like" tokenization and one module per dictionary (ipadic, mecab-ko-dic, ...). There are some assertions in the JapaneseTokenizer that checks some invariant of the ipadic (leftId == rightId for instance) but I guess we could move them in the dictionary module. This could be a nice cleanup if the goal is to handle multiple mecab dictionaries (in different languages).

While it has been slowly obsoleted, well-maintained and/or extended dictionaries risen up in recent years (e.g. mecab-ipadic-neologd, UniDic). To use them with Kuromoji, some attempts/projects/efforts are made in Japan.

While allowing more flexibility would be nice I wonder if there are that many different dictionaries. If the ipadic is obsolete we could also adapt the main distribution (kuromoji) to use the UniDic instead. Even if we handle multiple dictionaries we'll still need to provide a way for users to add custom entries. Mecab has an option to compute the leftId, rightId and cost automatically from a partial user entry so I wonder if this could help to avoid users to reimplement a dictionary from scratch ?

[Legacy Jira: Jim Ferenczi (@jimczi) on May 28 2019]

mikemccand commented 5 years ago

Hi @jimczi,

I just skimmed your comments and will try to answer your questions. Correct me if I missed your points.

This could be a nice cleanup if the goal is to handle multiple mecab dictionaries (in different languages).

I didn't think that but of course it seems good unification to me.

While allowing more flexibility would be nice I wonder if there are that many different dictionaries.

I my view, there are only two dictionary formats we should support (MeCab IPADIC and UniDic). There are some other old dictionaries - NAIST-jdic or ChaSen ipadic - but they are completely obsolete now and rarely used (as far as I know). There are several well-known extensions of mecab-ipadic and unidic, so we can support almost all common variants (in Japan) by supporting those.

If the ipadic is obsolete we could also adapt the main distribution (kuromoji) to use the UniDic instead.

Yes, I think so. But I am not sure that we should select UniDic as default immediately. While users often complain about MeCab IPADIC, it is still high-quality and widely accepted dictionary. And even when we change the default dictionary to UniDic (I think it is definitely OK after giving some time to users for migrating/testing their applications), we have to provide the option to use old MeCab IPADIC for users who trust it and do not need "contemporary words".

Even if we handle multiple dictionaries we'll still need to provide a way for users to add custom entries. Mecab has an option to compute the leftId, rightId and cost automatically from a partial user entry so I wonder if this could help to avoid users to reimplement a dictionary from scratch ?

Yes, custom entries (user dictionaries) are needed option to customize tokenization behaviour, for uses with little NLP skill :)

[Legacy Jira: Tomoko Uchida (@mocobeta) on May 28 2019]

mikemccand commented 5 years ago

Hi everybody, Thank you for opening the issue! @mocobeta

To be honest, at first, when I talked about a custom system dictionary, I did not see a big sight.

Anyway, the structure I think is as follows.

As Tomoko said, make developer-oriented dictionary build tool The "ant regenerate" command inside the build.xml that I checked has the following steps. 1) Compile the code (compile-tools) 2) Download the jar file (download-dict) 3) Save Noun.proper.csv diffs (patch-dict) 4) Run DictionaryBuilder and make dat files (build-dict)

It does not matter if user builds only system dictionary. (Of course there is a problem to modify the classpath) (ex) ant build-dict ipadic /home/my/path/customDicIn(custom-dic input) /home/my/path/customDicOutput(dat output) utf-8 false

However, if user needs to get a dictionary from the server, they should modify the build.xml. As I know, the url path is hard-coded. Of course, the user can run by modifying ivy.xml and build.xml. But my personal opinion is user should not touch Lucene's internal code. (even build script) Maybe the user is afraid to change or feel reluctant to use it. (Especially who have not used Apache Ant) However, I think that this may be different for every person.

Version Control I actually think this is the biggest problem. As I mentioned in the email, if the Lucene version goes up, users have to rebuild their system dictionary unconditionally and put it in the jar. Because the current process is, 1) Process like 1. 2) move users system directory dat files to resources/org.apache.lucene.analysis.ja.dict 3) ant jar

Because of the 3), the user always has to rebuild kuromoji module or fix the kuromoji jar. The users can feel irritated, when there is no kuromoji module change in the version up.

This problem can be solved easily if the system dictionary can only be parameterized in JapaneseTokenizer. (Of course, the expert javadoc is required)

[Legacy Jira: Namgyu Kim (@danmuzi) on May 28 2019]

mikemccand commented 5 years ago

Hi @danmuzi,

thanks for speaking up about this.

The dictionary builder tool should be included in Lucene binary distribution tar/zip so that developers will not need to download Lucene source code. I imagine it will be an ordinary command-line tool that runs on JDK with options to specify the dictionary format ("mecab-ipadic" or "unidic") and raw dictionary data location (something like -path or -url).

About version control, I believe that dictionary data encoding should be compatible at least within same Lucene major version (like Codecs) so that users do not need to re-build the dictionary at every Lucene/Solr or Elasticsearch minor version up. Ideally the dictionary version is independent from Lucene version, but the encoding could be changed for some reason (e.g. performance improvement) in the future. Anyway I think it happens not so often.

[Legacy Jira: Tomoko Uchida (@mocobeta) on May 28 2019]

mikemccand commented 5 years ago

As a suggestion for the first concrete step: the various asserts in the current tooling should be fixed to be real if statements+exception messages. I think the current use of asserts was just laziness on my part that should be fixed if we want to try to make the tooling more usable and robust.

[Legacy Jira: Robert Muir (@rmuir) on May 29 2019]

mikemccand commented 5 years ago

Thanks for the answers :D

Oh, I think your opinion is good, @tomoko. It is important to ensure that users do not modify the source code. If we apply that approach, it seems to be able to create a build tool that supports two pre-types in the CLI environment. And it's very important to be good at handling exceptions as @rmuir said. Otherwise, users will think that stability is poor.

About version control, I totally agree with you.

[Legacy Jira: Namgyu Kim (@danmuzi) on May 29 2019]

mikemccand commented 5 years ago

Thaks @rmuir,

I checked current assertions in the build tool. It seems there are mecab-ipadic specific constraints and more general ones which have to be checked for both of mecab-ipadic and unidic. The former should be relaxed to support mecab-ipadic extensions or UniDic format, and latter one would be better to be modified to exceptions as suggested.

[Legacy Jira: Tomoko Uchida (@mocobeta) on May 30 2019]

mikemccand commented 5 years ago

I looked over the diffs between current Kuromoji and Nori code, to examine we are able to integrate those.

While there are many overlaps (copied lines) but also a lot differences which cannot be merged easily. (I've not studied about the details, but there may have been Korean dictionary/tokenizer specific modifications?)

I've heard MeCab and MeCab IPADIC was designed to be language independent, but it seems things are not as easy as one sees. :)

I think the generalization/integration (if it's possible) should be treated in different issues. I'd like to propose following:

Keep JapaneseTokenizer and KoreanTokenizer as individual tokenizers (as is). Merging them seems not to be feasible to me, but others might have good solutions or ideas for it.
First, decouple the encoded system dictionary (mecab-ipadic) to a separated jar from the kuromoji jar and clean up the dictionary builder tool. This is the scope of this issue.
Then generalize the dictionary builder tool to make it able to handle Korean dictionary (mecab-ko-dic), on the separated issue.
Lastly decouple the korean system dictionary to a separated jar from the nori jar, maybe on the another issue.

[Legacy Jira: Tomoko Uchida (@mocobeta) on May 30 2019]

mikemccand commented 5 years ago

This sounds like a great plan @tomoko. Decoupling the system dictionary should help for the merge of the Korean tokenizer but I agree that this merge is out of scope for this issue.

[Legacy Jira: Jim Ferenczi (@jimczi) on May 30 2019]

mikemccand commented 5 years ago

That's a good plan, @mocobeta!

About Tokenizer... I'm not sure that is better to combine KoreanTokenizer and JapaneseTokenizer.

If we want, there is a way. KoreanTokenizer and JapaneseTokenizer have parts with duplicate code. We can create an abstract class called MecabTokenizer. (both are mecab based) MecabTokenizer will have duplicate things. (like isPunctuation(), WrappedPositionArray inner-class, ...) What do you think about it?

Anyway, combining KoreanTokenizer and JapaneseTokenizer will be a big task and I think it's right to create a new issue.

[Legacy Jira: Namgyu Kim (@danmuzi) on May 30 2019]

mikemccand commented 5 years ago

Separating out the dictionaries is a great idea.

@rmuir made great efforts making the original dictionary tiny and some assumptions were made based on the value ranges of the original source data.

To me it sounds like a good idea to keep the Japanese and Korean dictionaries separately initially and consider combining them later on when implications of such combination is clear. I agree with @jimczi.

[Legacy Jira: Christian Moen on May 31 2019]

mikemccand commented 5 years ago

Yeah there is some optimizations specific to Japanese writing systems in the way we store the data: for example writing katakana with a single byte, hiragana<->katakana transformations, and so on. The kana really must be optimized, due to the way it is used in the language and the dictionary, otherwise it wastes tons of space.

But there are also some of these assertions about value ranges specific to ipadic: not so great. From my perspective, the problem of other dictionaries was always a licensing one. We want to at least be able to test any dictionary we want to support. This has changed, so I think it makes sense to look at how to really support other japanese dictionaries compatible with the apache license. It might mean representing some data differently in the worst case because we need more bits.

I don't have any clear idea on how to test and package that. Maybe with the gradle build it will become easier to build+test two jar files to make it easiest on the user to consume? It may not have the proper tools and APIs to support the custom dictionary case (yet), but it would give users choices.

Just trying to help us think about how to bite off small pieces at a time...

[Legacy Jira: Robert Muir (@rmuir) on May 31 2019]

mikemccand commented 5 years ago

Thank you guys, for your comments and suggestions.

To me (having little knowledge about the code for now), to be honest it looks quite difficult to merge the two tokenizers because those seems have been greatly evolved respectively. So I thought we can share only DictionaryBuilder between Kuromoji and Nori module, as I wrote in the previous comment. However if it can be done to combine tokenizers in the future, this will be great unification.

For the time being, I will work for JapaneseTokenizer and welcome implementation advices from the viewpoint of possible future integration.

[Legacy Jira: Tomoko Uchida (@mocobeta) on May 31 2019]

mikemccand commented 5 years ago

Related to the comment from @rmuir,

This has changed, so I think it makes sense to look at how to really support other japanese dictionaries compatible with the apache license. It might mean representing some data differently in the worst case because we need more bits.

I will try to describe my current rough thoughts.

As far as Japanese dictionaries, I think we should restrict supported dictionaries to only "MeCab IPADIC" and "UniDic", both are distributed under OSS licenses which are compatible with ASF's policy. Although there are a few well-known extensions of them or user customized ones already used with Kuromoji in the wild, we need not (actually, cannot) to give "official" support or strict compatibility policy to the variants.

Nevertheless, it would be good for users to make small adjustments at some point of the implementation to allow to build well-known variants of mecab-ipadic or unidic. More concretely, for example: "relaxing the maximum string length to be allowed as inputs from X to Y when option Z is given, with possible performance degredation or unoptimized data representation".

Does that make sense to you? Currently I have no idea about the performance or bit level data size tuning which have been already done, so please correct me if I missed the points.

Anyway I'd like to start from understanding the exact meaning of the current assertions/restrictions in the builder class, and let me discuss with you about how we should/can change this, a little later :)

[Legacy Jira: Tomoko Uchida (@mocobeta) on May 31 2019]

mikemccand commented 5 years ago

So I thought we can share only DictionaryBuilder between Kuromoji and Nori module, as I wrote in the previous comment.

I agree with your idea. @mocobeta

I don't think it would be difficult to merge DictionaryBuilder. (except BinaryDirectoryWriter) But I think BinaryDirectoryWriter case can be solved if we separate methods. (+ use DictionaryFormat) Can I try this when you are concentrating on JapaneseTokenizer?

[Legacy Jira: Namgyu Kim (@danmuzi) on May 31 2019]

mikemccand commented 5 years ago

I don't think it would be difficult to merge DictionaryBuilder. (except BinaryDirectoryWriter)

But I think BinaryDirectoryWriter case can be solved if we separate methods. (+ use DictionaryFormat) Can I try this when you are concentrating on JapaneseTokenizer?

Please open another issue for this. Any generalization of DictionaryBuilder and its auxiliary classes should be treated another issue, if you agree with my plan.

First, decouple the encoded system dictionary (mecab-ipadic) to a separated jar from the kuromoji jar and clean up the dictionary builder tool. This is the scope of this issue.

Then generalize the dictionary builder tool to make it able to handle Korean dictionary (mecab-ko-dic), on the separated issue.

Lastly decouple the korean system dictionary to a separated jar from the nori jar, maybe on the another issue.

Just to make things clear, I will focus on JapaneseTokenizer and DictionaryBuilder in kuromoji module on this issue so the DictionaryBuilder (including its auxiliary classes) can be significantly modified here. To avoid confusion, personally I'd like to proceed things in a right order - cleaning up first, then generalizing. But if you are sure that we can go in parallel, can you share your plan?

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 01 2019]

mikemccand commented 5 years ago

Oh, you're right. @mocobeta :D I'll make a new JIRA issue after clean up the changes.

To avoid confusion, personally I'd like to proceed things in a right order - cleaning up first, then generalizing. But if you are sure that we can go in parallel, can you share your plan?

Sure. It's an important thing. I think we can proceed in parallel.

There are two possible cases. 1) You finish between JapaneseTokenizer and DictionaryBuilder job first. In that case, I can pull your new code and merge with nori's DictionaryBuilder.

2) I finish merging DictionaryBuilder(nori) and DictionaryBuilder(kuromoji) first. In that case, you can pull and continue. The DictionaryBuilder logic of kuromoji does not change at all in my work.

But if you think it is a little inefficient, I'll do later. What do you think about it?

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 01 2019]

mikemccand commented 5 years ago

OK, if you will be able to merge two dictionary builders without any affects on kuromoji, just open the issue (don't forget to add a link to here) and work for it. Once your branch/patch is successfully approved to push to the upstream Lucene repo (ASF gitbox), I will merge that from upstream and continue my work.

Please keep in mind: I won't merge any branch/patch to my local branch which has not yet merged to upstream master (in other words, I will merge or cherry-pick only from upstream master). And vise versa, never merge my WIP branch/patch to your branch/patch.

I cannot take time to review your branch/patch (anyway I am still new to kuromoji or nori), hope someone with commit privilege take care @danmuzi's patch.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 01 2019]

mikemccand commented 5 years ago

Thanks for your reply! @mocobeta

I created the LUCENE-8817 issue and set up this issue with the "related to" link. I'll notify you if there is some progress.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 01 2019]

mikemccand commented 5 years ago

Hi @danmuzi,

thanks for opening the issue. I hope both changes will be safely merged at some point in time.

I'll notify you if there is some progress.

I will keep watching your updates in there, though I'd like to withhold approval or disapproval to it.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 01 2019]

mikemccand commented 5 years ago

Thanks! @mocobeta :D

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 01 2019]

mikemccand commented 5 years ago

I see that in BinaryDictionaryWriter we restrict incoming leftID (and rightID) to be <4096 because we are going to pack into a 16-bit short with 3 flag bits. However it seems we have room for one more bit (since 2^(16-3) == 8192). Am I missing something? Do we use that other bit somewhere? I see eg that in {{BinaryDictionary}} when we decode, we >>> 3 to get back the ids, so I think it should be OK to allow ids up to 8191. @rmuir do you know why it is currenly limited to 4096? Also I think it would make sense to change the asserts there to be IllegalArgumentException so they are raised whenever the tool is run, since we would get garbage if this limit is exceeded, and (I think) nothing else will catch it.

[Legacy Jira: Michael Sokolov (@msokolov) on Jun 11 2019]

mikemccand commented 5 years ago

Mike: yes, I agree with you. The use of assert was laziness on my part: we should treat it as technical debt and fix it. see my earlier comments on this JIRA issue for elaboration.

[Legacy Jira: Robert Muir (@rmuir) on Jun 11 2019]

mikemccand commented 5 years ago

Thanks Robert, yeah I understand this was built for a single dictionary, not a general-purpose tool, and hardening is required to enable wider usage.

[Legacy Jira: Michael Sokolov (@msokolov) on Jun 12 2019]

mikemccand commented 5 years ago

I'd like to add some more information: the leftID (and rightID) is tied to the POS tags and in practice there are not so much pos tag variations. I think the current constraint leftId < 4096 (or leftId < 8191, if it can be easily changed so) is perfectly okay if following conditions are met.

The dictionary learner/re-trainer program included in the mecab-ipadic devtool does not generate leftID (and rightID) values larger than 4096 (or 8191).
UniDic (I'd like to support this dictionary on this issue as I wrote in the issue description) has no leftID (and rightID) values greater than 4096 (or 8191).
A few well-known variants of mecab-ipadic or unidic does not have leftID (and rightID) values larger than 4096 (or 8191).

Give me some time to examine if we need to re-consider the constraint. (It's just a guess but the original mecab itself is also a performance-savvy software, so it could have similar restrictions for its dictionary format.) At least about the point 3, I think I can talk with the dictionary developers about it before tackling with Lucene code, if it's needed.

There is another possibility that users give large values to leftIDs (and rightIDs) in their customized dictionary by hand, however I don't think we should take care about that. I have no idea about Korean dictionaries.

I agree with that it will be better to change the all assertions to some Exceptions so that users can figure out the problem with their customized dictionary.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 12 2019]

mikemccand commented 5 years ago

I opened LUCENE-8863 to cover some small, but blocking, issues I uncovered while loading a custom dictionary.

[Legacy Jira: Michael Sokolov (@msokolov) on Jun 15 2019]

mikemccand commented 5 years ago

I opened LUCENE-8869 to just decouple the system dictionary resource to a separated jar, and use it as default (as is).

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 19 2019]

mikemccand commented 5 years ago

LUCENE-8871 opened to cover moving dictionary builder tools into main kuromoji source tree, mostly so it gets tested properly.

[Legacy Jira: Michael Sokolov (@msokolov) on Jun 19 2019]

mikemccand commented 2 years ago

I'm sorry for leaving this for such a long time. We now have stable Gradle build infrastructure; I think we are ready to restart on this.

I thought only about Kuromoji when I opened this issue, but my mind has slightly changed since then. Before working with this issue, I wonder if we should explore the possibility of unifying the dictionary builder/loader of the Kuromoji and Nori, so that both modules can benefit from the decoupling of data and analysis engine simultaneously. Also, code duplication is significantly reduced. The common or base dictionary builder could be placed in analysis-common.

As for the decoupling, JMS support is also needed; maybe we could have to open up some packages in the dictionary module to the analysis-common module.

I have to say I still don't have a fully detailed picture (and my progress will be slow for such extensive refactoring); I would welcome any feedback.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jan 27 2022]

mikemccand / stargazers-migration-test

Decouple Kuromoji's morphological analyser and its dictionary [LUCENE-8816] #814

Legacy Jira details