Closed mikemccand closed 6 years ago
Created a PR. It is based on Kuromoji's examples.
https://github.com/apache/lucene-solr/pull/434
Note: I've tested all parameters in this example schemas with CustomAnalyzer, but not tested with Solr yet. Check the XML settings with Solr, please.
[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 10 2018]
And, I think it would be better if Korean natives check that example values are good as default :)
[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 10 2018]
I added the solr schema fragment to the solr issue. Works for me: SOLR-12655
Your example is missing lowercasing (like the analyzer does), so western text is correctly normalized.
[Legacy Jira: Uwe Schindler (@uschindler) on Aug 10 2018]
The full schema snippet is that is identical to default KoreanAnalyzer as shipped in Lucene:
<fieldType name="text_ko" class="solr.TextField" >
<analyzer>
<!-- decompoundMode: mixed (is keep original term and add all decompounded terms), discard (default, removes the compound form, only keeps the parts), none (no decompounding) -->
<tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard" outputUnknownUnigrams="false"/>
<!-- removes some part of speech stuff like EOMI (Pos.E) -->
<filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
<!-- Replaces term text with the Hangul transcription of Hanja characters, if applicable: -->
<filter class="solr.KoreanReadingFormFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
[Legacy Jira: Uwe Schindler (@uschindler) on Aug 10 2018]
KoreanAnalyzer discards some parameters (for example, KoreanTokenizerFactory has additional parameters "userDictionary" and "userDictionaryEncoding".) I think Javadoc examples should include all available parameters so my example settings include all parameters which are accepted by TokenizerFactory/TokenFilterFactoys.
About LowerCaseFilterFactory, of course it is needed in complete Analyzer settings,
I "feel" Javadoc example should focus on the targeted component only (like Kuromoji example settings below.)
[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 10 2018]
So here are my proposal for javadoc's example settings (my pull request) :)
For KoreanTokenizerFactory:
<fieldType name="text_ko" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KoreanTokenizerFactory"
decompoundMode="discard"
userDictionary="user.txt"
userDictionaryEncoding="UTF-8"
outputUnknownUnigrams="false"
/>
</analyzer>
</fieldType>
For KoreanPartOfSpeechStopFilterFactory:
<fieldType name="text_ko" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KoreanTokenizerFactory"/>
<filter class="solr.KoreanPartOfSpeechStopFilterFactory"
tags="E,J"/>
</analyzer>
</fieldType>
For KoreanReadingFormFilterFactory:
<fieldType name="text_ko" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KoreanTokenizerFactory"/>
<filter class="solr.KoreanReadingFormFilterFactory"/>
</analyzer>
</fieldType>
Update: Added brief descriptions for each parameter (please see the pull request,) though unfortunately, Kuromoji's documentation lacks those.
[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 10 2018 [updated: Aug 11 2018]]
Slightly off topic, feel free to ignore, but I think Solr example settings should be removed from TokenizerFactory/TokenFilterFactory/CharFilterFactory documentation. I suppose there may be historical reasons, so I followed the convention, but it is not reasonable to add Solr schema examples here. Not XML schema examples, but parameter descriptions are needed to each Factory documentation.
[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 11 2018]
I've tested those three settings with Solr 7.4.0, works for me. (I copied lucene-analyzers-nori-7.4.0.jar
and user dictionary file from lucene distribution package to solr lib directory.)
[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 11 2018]
I think this pull request is almost ready to merge. Could anyone take care this? I believe documentation for analyzer components is very important & a good starting point to newbies. :)
[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 11 2018]
+1. I will merge it soon!
Slightly off topic, feel free to ignore, but I think Solr example settings should be removed from TokenizerFactory/TokenFilterFactory/CharFilterFactory documentation. I suppose there may be historical reasons, so I followed the convention, but it is not reasonable to add Solr schema examples here. Not XML schema examples, but parameter descriptions are needed to each Factory documentation.
There is an issue open already (I think, can't find it now). I agree, the XML snippets should go away. Instead we can add some Javadoc tag for this like @factoryProp name description
. This is much better. We should also document the SPI name of each factory.
[Legacy Jira: Uwe Schindler (@uschindler) on Aug 11 2018]
Thank you @thetaphi
!
and, thanks for your explanation.
There is an issue open already (I think, can't find it now). I agree, the XML snippets should go away. Instead we can add some Javadoc tag for this like
@factoryProp
name description. This is much better. We should also document the SPI name of each factory.
[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 11 2018]
No problem. I merged already. Just running document-linter to verify correctness of Javadocs.
[Legacy Jira: Uwe Schindler (@uschindler) on Aug 11 2018]
Another idea: To make the propertie sof all analyzers easily available for inspection by the APIs in Solr, we may add runtime annotations to those classes, describing the properties. Just an idea.
[Legacy Jira: Uwe Schindler (@uschindler) on Aug 11 2018]
Commit e9addea0871a28517c5202e9d12969719d20c90e in lucene-solr's branch refs/heads/master from @thetaphi
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e9addea
Merge branch 'jira/lucene-8453' of https://github.com/mocobeta/lucene-solr-mirror LUCENE-8453: Add documentation to analysis factories of Korean (Nori) analyzer module This closes #434
[Legacy Jira: ASF subversion and git services on Aug 11 2018]
Commit d8ecf976124eb519e1f8c66e6749e246976a95d9 in lucene-solr's branch refs/heads/branch_7x from @thetaphi
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d8ecf97
Merge branch 'jira/lucene-8453' of https://github.com/mocobeta/lucene-solr-mirror LUCENE-8453: Add documentation to analysis factories of Korean (Nori) analyzer module This closes #434
[Legacy Jira: ASF subversion and git services on Aug 11 2018]
Thanks [~Tomoko Uchida]!
[Legacy Jira: Uwe Schindler (@uschindler) on Aug 11 2018]
It may not be good manners to add comments to closed issue, but I'd like to leave a reminder for myself.
Another idea: To make the propertie sof all analyzers easily available for inspection by the APIs in Solr, we may add runtime annotations to those classes, describing the properties. Just an idea.
I like the idea, it would be nice that some kind of properties management/discovery mechanism (I have no concrete implementation image, just a vague concept) is equipped in {Tokenizer|CharFilter|TokenFilter}Factorys.
It will be handy for documentation and Solr, and also for CustomAnalyser (I sometimes use it for my nlp projects.)
I'll try it, not soon, after I'll have finished current ongoing projects.
[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 12 2018]
Korean analyzer (nori) javadoc needs example schema settings.
I'll create a patch.
Legacy Jira details
LUCENE-8453 by Tomoko Uchida (@mocobeta) on Aug 10 2018, resolved Aug 11 2018 Linked issues: