mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Build kuromoji system dictionary as a separated jar and load it from JapaneseTokenizer at runtime [LUCENE-8869] #866

Open mikemccand opened 5 years ago

mikemccand commented 5 years ago

This is a sub-task for LUCENE-8816. In this issue, I will try to make small but self-contained changes to kuromoji system dictionary.

Also, some refactoring of the directory/source tree structure may be needed.


Legacy Jira details

LUCENE-8869 by Tomoko Uchida (@mocobeta) on Jun 19 2019, updated Jun 23 2019 Linked issues:

mikemccand commented 5 years ago

As a first step, I moved dictionary data (dat files) to a separated jar on my local branch. https://github.com/mocobeta/lucene-solr-mirror/commit/9def2b22f4e7467bef72edfac84c9f74f67289aa

In order to build and ship two jars (one for kuromoji analyzer, one for the system dictionary), I slightly changed the directory structure:

analysis/kuromoji/
├── build.xml
├── ivy.xml
├── src
│     ├── java
│     │     ├── org
│     │     └── overview.html
│     ├── resources
│     │     ├── META-INF
│     │     └── org
│     ├── test
│     │     └── org
│     └── tools
│           ├── java
│           ├── patches
│           └── test
└── sysdic
        └── src
              └── resources

Here, sysdic directory is added and all dat files are placed to sysdic/src/resources instead of src/resources by the build-dict task.

On the JapaneseTokenizer side, currently it holds all dictionary data within static singleton fields, we need to make it possible to flexibly load the dictionary data from a jar or a directory path (for testing purpose) when initializing a tokenizer so that users can choice arbitrary dictionary at runtime.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 23 2019]

mikemccand commented 5 years ago

@tomoko  there might be some minor conflicts with LUCENE-8871, since it also touches the code that reads the resources, but they should be easy to resolve, I think?

 

[Legacy Jira: Michael Sokolov (@msokolov) on Jun 23 2019]

mikemccand commented 5 years ago

@sokolov thanks for notifying, there may be minor conflicts but yes, they would be easily resolved. (Seems you are almost done so I will pick the changes from the master.)

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 23 2019]