Open mikemccand opened 5 years ago
As a first step, I moved dictionary data (dat files) to a separated jar on my local branch. https://github.com/mocobeta/lucene-solr-mirror/commit/9def2b22f4e7467bef72edfac84c9f74f67289aa
In order to build and ship two jars (one for kuromoji analyzer, one for the system dictionary), I slightly changed the directory structure:
analysis/kuromoji/
├── build.xml
├── ivy.xml
├── src
│ ├── java
│ │ ├── org
│ │ └── overview.html
│ ├── resources
│ │ ├── META-INF
│ │ └── org
│ ├── test
│ │ └── org
│ └── tools
│ ├── java
│ ├── patches
│ └── test
└── sysdic
└── src
└── resources
Here, sysdic
directory is added and all dat files are placed to sysdic/src/resources
instead of src/resources
by the build-dict
task.
On the JapaneseTokenizer side, currently it holds all dictionary data within static singleton fields, we need to make it possible to flexibly load the dictionary data from a jar or a directory path (for testing purpose) when initializing a tokenizer so that users can choice arbitrary dictionary at runtime.
[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 23 2019]
@tomoko there might be some minor conflicts with LUCENE-8871, since it also touches the code that reads the resources, but they should be easy to resolve, I think?
[Legacy Jira: Michael Sokolov (@msokolov) on Jun 23 2019]
@sokolov thanks for notifying, there may be minor conflicts but yes, they would be easily resolved. (Seems you are almost done so I will pick the changes from the master.)
[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 23 2019]
This is a sub-task for LUCENE-8816. In this issue, I will try to make small but self-contained changes to kuromoji system dictionary.
build-dict
task.Also, some refactoring of the directory/source tree structure may be needed.
Legacy Jira details
LUCENE-8869 by Tomoko Uchida (@mocobeta) on Jun 19 2019, updated Jun 23 2019 Linked issues: