Combine Nori and Kuromoji DictionaryBuilder [LUCENE-8817]

mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration

0 stars 0 forks source link

Combine Nori and Kuromoji DictionaryBuilder [LUCENE-8817] #815

Open mikemccand opened 5 years ago

mikemccand commented 5 years ago

This issue is related to LUCENE-8816.

Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab) If we make combine DictionaryBuilder, we can reduce the code size. But this task may have a dependency on the language. (like HEADER string in BinaryDictionary and CharacterDefinition, methods in BinaryDictionaryWriter, ...) On the other hand, there are many overlapped classes.

The purpose of this patch is to provide users of Nori and Kuromoji with the same system dictionary generator.

It may take some time because there is a little workload. The work will be based on the latest master, and if the LUCENE-8816 is finished first, I will pull the latest code and proceed.

Legacy Jira details

LUCENE-8817 by Namgyu Kim (@danmuzi) on Jun 01 2019, updated Jun 10 2019 Linked issues:

LUCENE-8816

mikemccand commented 5 years ago

I share the current status. The merging is almost over and I need some discussion.

I thought several structures.

Save in tools of analysis-common module. It is simple, but I think MeCab is difficult to see as a feature of analysis-common.
Create tools folder in analysis and set mecab-tools module in there. analysis/tools ─ analysis-common-tools (to-be) └ icu-tools (to-be) └ mecab-tools └ ... The problem with this is that the number of modules increases a lot because each tool is created as a module.
Create a module called mecab we can create a mecab module that is the starting point for merging nori and kuromoji. If we proceed in this direction, we will only have tools in src.

But this approach may not be easy to create the runnable jar. Because it will include the library. (ex: MecabAnalyzer, MecabTokenizer, ...)

Create a module called mecab-tools It's easy to develop, but there are other library modules in analysis. So something seems strange because it's only runnable-jar.

Number 2 seems to be the best, but I'm not sure yet. I would appreciate any comments.

I will go ahead if direction is set, but landing will be delayed a little. The reason is that the build system is going to change. (SOLR-13452) But if it does not matter, I will proceed.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 07 2019]

mikemccand commented 5 years ago

For me, it looks like a good starting point to create a directory analysis/mecab and place mecab-tools module (the option 4) under that.

We are already considering further integration of kuromoij and nori (at LUCENE-8816 and LUCENE-8812), and I suppose it would happen sooner or later. So how about looking at some grand design from here. For example:

analysis
└── mecab
         ├── common (module: analyzers-mecab-common)
         │       ├── build.xml
         │       └── src
         ├── kuromoji (module: analyzers-mecab-kuromoji)
         │       ├── build.xml
         │       └── src
         ├── nori (module: analyzers-mecab-nori)
         │       ├── build.xml
         │       └── src
         └── tools  (module: analyzers-mecab-tools)
                 ├── build.xml
                 └── src

On this issue, only "mecab-tools" module will be added (and the dependency on that should be added to current kuromoji and nori). That's just an idea and I am not an expert about the shadow maven poms. @rmuir, @jimczi and @cm may have different thoughts.

About the option 2, I don't think that's a good idea to change other modules' current structure (analysis-common or icu), opinions?

I will go ahead if direction is set, but landing will be delayed a little. The reason is that the build system is going to change. (SOLR-13452) But if it does not matter, I will proceed.

We need to take care the ongoing change in build infrastructure, but I think it would not be a very big concern that stops the work here (and LUCENE-8816) :) After pushing the commits to the master, you would be able to backport the changes to the gradle branch (I think Mark Miller and others will give you help or advice for the work).

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 09 2019]

mikemccand commented 5 years ago

Thanks, @tomoko. I don't think we should any "mecab" in the naming. Please let me elaborate a bit.

Kuromoji can read MeCab format models, but Kuromoji isn't a port of MeCab. Kuromoji has been developed independently without inspecting or reviewing any MeCab source code. This was an initial goal of the project to make sure we could use an Apache License.

The MeCab and Kuromoji feature sets are quite different and I think users will find it confusing if they expect MeCab and find that Kuromoji is much more limited.

I'm also unsure if Kudo-san will appreciate that we make an association by name like this. It certainly doesn't give due credit to MeCab, in my opinion, which is a much more extensive project.

In terms of naming, what about using "statistical" instead of "mecab" for this class of analyzers?

I'm thinking "Viterbi" could be good to refer to in shared tokenizer code.

This said, I think it could be a good to refer to "mecab" in the dictionary compiler code, documentation, etc. to make sure users understand that we can read this model format.

Any thoughts?

[Legacy Jira: Christian Moen on Jun 10 2019]

mikemccand commented 5 years ago

Hi @cm,

thanks for your comment! I used the term "mecab" without any deep thought in my previous comment. I respect your perspective and agree with that we should not use "mecab" in naming, except for the code which handles the dictionary "MeCab IPADIC".

I also like the idea to use "viterbi" for shared tokenizer code. Meanwhile, "statistical" sounds a little bit too general to me for describing the analyzers' functionality. I just thought about using "morphologic" or "morph" in the module name instead of "mecab", but there is already "morfologik" module so it would be confusing...

There is another idea: how about using "kuromoji" in the top level module name for both of Japanese and Korean analyzers, and changing current module names "kuromoji" and "nori" to "kuromoji-ja" and "kuromoij-ko"? They are just module names for internal use and not used in any exposed package or class or method names (as far as I know). And they are not used in user configuration files (as far as I know).

In order to clarify, my proposal would be changed like this. (I also changed "tools" to "dict-tools" for clarification.)

analysis
└── kuromoji
         ├── common (module: analyzers-kuromoji-common)
         │       ├── build.xml
         │       └── src
         ├── ja (module: analyzers-kuromoji-ja)
         │       ├── build.xml
         │       └── src
         ├── ko (module: analyzers-kuromoji-ko)
         │       ├── build.xml
         │       └── src
         └── dict-tools  (module: analyzers-kuromoji-dict-tools)
                 ├── build.xml
                 └── src

It looks natural to me, if we pursue the integration of the two analyzers. Does the change sound too aggressive (especially for Korean analyzer users)? I'd love to hear comments from others. :)

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 10 2019]

mikemccand commented 5 years ago

Sorry, Elasticseach analysis plugins heavily use "kuromoji" or "nori" in their naming so the change I wrote affects Elasticsearch users. (I feel like that it should use "japanese" or "korean" instead of "kuromoji" or "nori"...)

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 10 2019]

mikemccand commented 5 years ago

Thank you for your replies. @mocobeta and @cm :D

I was surprised at your deep thoughts.

analysis
└── ???
         ├── common (module: analyzers-???-common)
         │       ├── build.xml
         │       └── src
         ├── kuromoji (module: analyzers-???-kuromoji)
         │       ├── build.xml
         │       └── src
         ├── nori (module: analyzers-???-nori)
         │       ├── build.xml
         │       └── src
         └── tools  (module: analyzers-???-tools)
                 ├── build.xml
                 └── src

I agree with the module structure proposed by Tomoko. In my personal opinion, "analysis" is better than "analyzers".

In terms of naming, what about using "statistical" instead of "mecab" for this class of analyzers? I'm thinking "Viterbi" could be good to refer to in shared tokenizer code. This said, I think it could be a good to refer to "mecab" in the dictionary compiler code, documentation, etc. to make sure users understand that we can read this model format. Any thoughts?

About the name, the folder name "viterbi" looks much better than "statistical". But to be perfectly honest, I'm not sure that it's really right to use the algorithm name as the folder name. Most users probably don't know what viterbi is. It is also associated with the package name, and "org.apache.lucene.analysis.viterbi.ja" or "~.viterbi.ko" will confuse users. Or just use "org.apache.lucene.analysis.ja", it could be fine. It's because analysis-common is already doing like it. (not org.apache.lucene.common.cjk) It doesn't matter if we use it for administrative purposes, but I also want to hear some opinions from others.

how about using "kuromoji" in the top level module name for both of Japanese and Korean analyzers, and changing current module names "kuromoji" and "nori" to "kuromoji-ja" and "kuromoij-ko"?

I personally don't agree to use kuromoji-ko instead of nori. nori is already a familiar name to users. They may be confused about it.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 10 2019]

mikemccand commented 5 years ago

Let me make a small correction:

About the name, the folder name "viterbi" looks much better than "statistical".

I don't think "viterbi" is suitable for directory or module or package name, it could be used in shared tokenizer code (as @cm said).

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 10 2019]

mikemccand commented 5 years ago

Oh. I read it wrong. Please ignore that part. Thank you for checking. @tomoko

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 10 2019]