mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Nori, a Korean analyzer based on mecab-ko-dic [LUCENE-8231] #231

Closed mikemccand closed 6 years ago

mikemccand commented 6 years ago

There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic: It is available under an Apache license here: https://bitbucket.org/eunjeon/mecab-ko-dic

This dictionary was built with MeCab, it defines a format for the features adapted for the Korean language. Since the Kuromoji tokenizer uses the same format for the morphological analysis (left cost + right cost + word cost) I tried to adapt the module to handle Korean with the mecab-ko-dic. I've started with a POC that copies the Kuromoji module and adapts it for the mecab-ko-dic. I used the same classes to build and read the dictionary but I had to make some modifications to handle the differences with the IPADIC and Japanese. The resulting binary dictionary takes 28MB on disk, it's bigger than the IPADIC but mainly because the source is bigger and there are a lot of compound and inflect terms that define a group of terms and the segmentation that can be applied. I attached the patch that contains this new Korean module called godori nori. It is an adaptation of the Kuromoji module so currently the two modules don't share any code. I wanted to validate the approach first and check the relevancy of the results. I don't speak Korean so I used the relevancy tests that was added for another Korean tokenizer (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output against mecab-ko which is the official fork of mecab to use the mecab-ko-dic. I had to simplify the JapaneseTokenizer, my version removes the nBest output and the decomposition of too long tokens. I also modified the handling of whitespaces since they are important in Korean. Whitespaces that appear before a term are attached to that term and this information is used to compute a penalty based on the Part of Speech of the token. The penalty cost is a feature added to mecab-ko to handle morphemes that should not appear after a morpheme and is described in the mecab-ko page: https://bitbucket.org/eunjeon/mecab-ko Ignoring whitespaces is also more inlined with the official MeCab library which attach the whitespaces to the term that follows. I also added a decompounder filter that expand the compounds and inflects defined in the dictionary and a part of speech filter similar to the Japanese that removes the morpheme that are not useful for relevance (suffix, prefix, interjection, ...). These filters don't play well with the tokenizer if it can output multiple paths (nBest output for instance) so for simplicity I removed this ability and the Korean tokenizer only outputs the best path. I compared the result with mecab-ko to confirm that the analyzer is working and ran the relevancy test that is defined in HantecRel.java included in the patch (written by Robert for another Korean analyzer). Here are the results:

Analyzer Index Time Index Size MAP(CLASSIC) MAP(BM25) MAP(GL2)
Standard 35s 131MB .007 .1044 .1053
CJK 36s 164MB .1418 .1924 .1916
Korean 212s 90MB .1628 .2094 .2078

I find the results very promising so I plan to continue to work on this project. I started to extract the part of the code that could be shared with the Kuromoji module but I wanted to share the status and this POC first to confirm that this approach is viable. The advantages of using the same model than the Japanese analyzer are multiple: we don't have a Korean analyzer at the moment ;), the resulting dictionary is small compared to other libraries that use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the lattice on the fly to select the best path efficiently. The dictionary can be built directly from the godori module with the following command: ant regenerate (you need to create the resource directory (mkdir lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) first since the dictionary is not included in the patch). I've also added some minimal tests in the module to play with the analysis.


Legacy Jira details

LUCENE-8231 by Jim Ferenczi (@jimczi) on Mar 28 2018, resolved Apr 13 2018 Attachments: LUCENE-8231.patch (versions: 11), LUCENE-8231-remap-hangul.patch Linked issues:

mikemccand commented 6 years ago

I really like the approach here, thanks Jim!

In the kuromoji case there are is a lot of japanese-specific compression that Uwe and I did, if you are worried about size/ram we can try to shrink this korean data in ways that make sense for it. That can really be a followup/polish/nice-to-have: how big is the built JAR now? something semi-reasonable?

I'll try to at least build your proof of concept and poke around. Funny to see HantecRel.java in the patch, it looked somehow familiar :)

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

I was able to build and run tests with your instructions.

Answering my own question, here is the built size:

-rw-rw-r-- 1 rmuir rmuir 9063332 Mar 28 20:27 lucene-analyzers-godori-8.0.0-SNAPSHOT.jar

Here are uncompressed sizes and gzip ratios:

-rw-rw-r-- 1 rmuir rmuir 65564 Mar 28 20:26 CharacterDefinition.dat -rw-rw-r-- 1 rmuir rmuir 11178837 Mar 28 20:26 ConnectionCosts.dat -rw-rw-r-- 1 rmuir rmuir 11408895 Mar 28 20:26 TokenInfoDictionary$buffer.dat -rw-rw-r-- 1 rmuir rmuir 5640925 Mar 28 20:26 TokenInfoDictionary$fst.dat -rw-rw-r-- 1 rmuir rmuir 811783 Mar 28 20:26 TokenInfoDictionary$targetMap.dat -rw-rw-r-- 1 rmuir rmuir 129 Mar 28 20:26 UnknownDictionary$buffer.dat -rw-rw-r-- 1 rmuir rmuir 36 Mar 28 20:26 UnknownDictionary$targetMap.dat

adding: CharacterDefinition.dat (deflated 99%) adding: ConnectionCosts.dat (deflated 89%) adding: TokenInfoDictionary$buffer.dat (deflated 71%) adding: TokenInfoDictionary$fst.dat (deflated 25%) adding: TokenInfoDictionary$targetMap.dat (deflated 67%) adding: UnknownDictionary$buffer.dat (deflated 44%) adding: UnknownDictionary$targetMap.dat (deflated 25%)

I'll take a peek and try to get familiar with the data, just for kicks. But I think sizes are reasonable already, thanks again for bringing this all together.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

Should there be a ReadingFormFilter similar to the kuromoji case? e.g. this would yield hanja -> hangul if someone wants that. I remember seeing some of this in that hantec corpus, and I think the other analyzer did this conversion. so you could try it out for kicks.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

Thanks for looking Robert !

Should there be a ReadingFormFilter similar to the kuromoji case?

I attached a new patch that adds this filter, the readings were already in the binary dictionary so this does not change the size of the jar. 

In the kuromoji case there are is a lot of japanese-specific compression that Uwe and I did, if you are worried about size/ram we can try to shrink this korean data in ways that make sense for it. That can really be a followup/polish/nice-to-have: how big is the built JAR now? something semi-reasonable?

 

I think the size is reasonable especially if you compare with other libraries that use the mecab-ko-dic ;).

Though there are still some rooms for improvement. I did not add the semantic class of the token but we could do the same than for the Japanese dic where the pos are added in a separate file. The semantic class + POS is unique per leftId so this could also save 1 byte in the binary dictionary (we use 1 byte per POS per term in the main dictionary).

The expression that contains the decompounds can also be compressed. For compound nouns I serialize the segmentations with the term but we could just use offset from the surface form. It doesn't work for Inflects which can add tokens or use a different form. To be honest I don't know how we can derive the offsets for the decompound of Inflects, I don't think there is an easy way to do that but I could be completely wrong.

 

In the patch I attached the user dictionary is broken, I copied the one from Kuromoji but we should probably change it to accept simple nouns (NNG or NNP) where there is no segmentation and use the PREANALYSIS type to add custom segmentations (or COMPOUND for nouns only).

 

I talked to my Korean colleagues and they told me that godori has a negative meaning in Korea. It is linked with illegal gambling and it also has an ancient meaning of "killed by king's order" which is bad :(. This is what happens when you pick a name without knowing the culture so apologize for this. I changed the name to "nori" which is the proposal they made, it means joy/play. This is a very generic name, in Japanese it means seaweeds and is used to wrap sushis and onigiri which I find nice because it 's also a reference to the Japanese analyzer which was used as a root for this.

 

 

[Legacy Jira: Jim Ferenczi (@jimczi) on Mar 29 2018]

mikemccand commented 6 years ago

The expression that contains the decompounds can also be compressed. For compound nouns I serialize the segmentations with the term but we could just use offset from the surface form. It doesn't work for Inflects which can add tokens or use a different form. To be honest I don't know how we can derive the offsets for the decompound of Inflects, I don't think there is an easy way to do that but I could be completely wrong.

I would really do this, I think it will help with the common case. I think you should just add a way to "escape" and encode this "literally" just like today for the inflected verb/adjective forms as a corner case. They are like 5% of the total dictionary, don't bother optimizing them.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

How about sharing code between the 2 modules? I had no time to closely look into this.

It's long time ago, but I remember that Robert and I did a lot to shrink the files. Yes there was a lot "compression" involved, but mostly it was thinking of better data structures. We should maybe review this here, too. I may have some time over easter vacation, so we can maybe have a "hackaton".

[Legacy Jira: Uwe Schindler (@uschindler) on Mar 29 2018]

mikemccand commented 6 years ago

I think its ok if they don't share code initially. We can try to maybe factor out a "kuromoji engine" or something into a analyzers-common class, but seems like a lot of work.

Yes for the compression it just means doing a lot of experimentation. For example in this case I am not sure about the whole FST representation. Root arc caching of all syllables is heavy there, thats a big range, and very different than a small alphabet. What is the performance of this analyzer if this caching is disabled?

And it hints at another possibility: maybe syllables should not be encoded into the FST? You could try just encoding the decomposed jamo form instead. Then in general you are working with a small alphabet in the FST: it would make the traversal cost 3x, but would allow efficient root arc caching and prevent huge slow binary searches of many syllables. Basically a larger but "easier" FST.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

Just in case its unclear in the above, i'm saying input of 한 could go thru the FST as:ᄒ + ᅡ + ᆫ. But you can then swap jamo range with latin-1 range so that its just 3 bytes instead of 6 with a UTF-8 encoding. Don't waste time with the hanja as its presumably rare, which matches UTF-8 expectations. This way you'd be encoding actual "characters" instead of syllables and the FST is bigger but has a nicer structure. I feel like i did this on the other korean analyzer, but its been a long time and I can't remember. there may even be code there in its branch.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

Root arc caching of all syllables is heavy there, thats a big range,

I think that root cache is already restricted to a max. of 0x80 elements.

https://github.com/apache/lucene-solr/blob/ca22f17662c9b79ada1f90fb200f76d9a58c0e75/lucene/core/src/java/org/apache/lucene/util/fst/FST.java#L384-L390

[Legacy Jira: Dawid Weiss (@dweiss) on Mar 29 2018]

mikemccand commented 6 years ago

and looking more, you'd need full byte range to do that. So a BYTE1 FST with raw bytes (no UTF-8) but its doable. The root cache would just be 256 entries. Maybe you just have a separate BYTE2 FST for the other stuff such as hanja forms. but i think overall the performance may be faster. The decomposition/recomposition is not so bad from what i remember, its some simple math on the unicode codepoint numbers.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

I think that root cache is already restricted to a max. of 0x80 elements.

https://github\.com/apache/lucene-solr/blob/ca22f17662c9b79ada1f90fb200f76d9a58c0e75/lucene/core/src/java/org/apache/lucene/util/fst/FST\.java#L384-L390

Dawid, the code here has its "own" cache, but its huge and caching whole precomposed syllables.

    for (int i = 0; i < rootCache.length; i++) {
      if (fst.findTargetArc(0xAC00 + i, firstArc, arc, fstReader) != null) {
        rootCache[i] = new FST.Arc<Long>().copyFrom(arc);
      }
    }

Kuromoji does a similar trick but its just cache the kana which are alphabets.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

Ah, sorry. I though it's using FST directly. When I experimented with large alphabets (full terms) on the first arc I ended up using an associative array from a symbol to its target's offset – this was faster than binary lookup on the array-expanded root node (but we made huge amounts of those lookups alone, not part of the full indexing pipeline – the difference may be negligible when embedded in a larger algorithm).

[Legacy Jira: Dawid Weiss (@dweiss) on Mar 29 2018]

mikemccand commented 6 years ago

yeah its just the general case that if you only have 256 possible values for a node because its a "character", we have more options to make it fast. Today because this FST represents syllables instead, it means the FST is gonna have less nodes, but huge numbers of transitions from each one, so I expect lookups are pretty slow due to that.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

and looking more, you'd need full byte range to do that. So a BYTE1 FST with raw bytes (no UTF-8) but its doable. The root cache would just be 256 entries. Maybe you just have a separate BYTE2 FST for the other stuff such as hanja forms. but i think overall the performance may be faster. The decomposition/recomposition is not so bad from what i remember, its some simple math on the unicode codepoint numbers.

 

That's a good idea. I'll give it a try. FYI I reindexed the HantecRel corpus without the root arc caching and it took 300s to build instead of the 200s so caching helps but I agree that caching the full syllabe range is too much.

[Legacy Jira: Jim Ferenczi (@jimczi) on Mar 29 2018]

mikemccand commented 6 years ago

Yeah i think thats a good sign that the huge binary searches in the FST are slow? So its promising at least as far as a different representation being faster. There is some limit in FST somewhere to use "array arcs" but I don't remember what it is. Hopefully you'd generally get that optimization and we'd be looking at 3 O(1) lookups instead of 1 O(log n) lookup for each syllable. Unfortunately we won't know if its the right tradeoff unless we experiment and find it out.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

See code from that other analyzer for doing these kind of lookups: https://svn.apache.org/viewvc/lucene/dev/branches/lucene4956/lucene/analysis/arirang/src/java/org/apache/lucene/analysis/ko/dic/HangulDictionary.java?r1=1534141&r2=1534364

The whole diff for that change is here: https://svn.apache.org/viewvc?view=revision&revision=1534364

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

I tried this approach and generated a new FST with the remap chars. The size of the FST after conversion is 4MB + 1MB for the separated Hanja FST which is roughly the same size as the FST with the hangul syllab and the Hanja together (5.4MB). I also ran the HantecRel indexation and it tooks approximatively 235s to build (I tried multiple times and the times were pretty consistent) with root caching for the 255 first arcs. That's surprising because it's slower than the FST with hangul syllab and root caching (200s) so I wonder if this feature is worth the complexity ? I checked the size of the root caching for the 11,171 syllabs for Hangul and it takes approximatively 250k so that's not bad considering that this version is faster.

 

I'll try the compression for compounds now.

[Legacy Jira: Jim Ferenczi (@jimczi) on Mar 29 2018]

mikemccand commented 6 years ago

well according to my commit years ago it was "smaller and much faster". But I don't remember exactly what the numbers were, only that it was worth the trouble. Maybe something wasn't quite right? I remember it being a big difference for lookup performance. Do you have a patch for your experiment somewhere? i wouldn't mind taking a look to see if it was something silly.

[Legacy Jira: Robert Muir (@rmuir) on Mar 29 2018]

mikemccand commented 6 years ago

Sure I attached a new patch (LUCENE-8231-remap-hangul.patch) that applies the remap at build and analyze time. I skipped all entries that are not hangul or latin-1 chars to make it easier to test. I must have missed something so thanks for testing !

 

[Legacy Jira: Jim Ferenczi (@jimczi) on Mar 29 2018]

mikemccand commented 6 years ago

Thanks for uploading the patch! I will dig into this some, hopefully it doesn't turn out to be a failed experiment.

[Legacy Jira: Robert Muir (@rmuir) on Mar 30 2018]

mikemccand commented 6 years ago

I attached a new patch that adds a better compression for the compounds (saves 1MB on the total size) and fixes the handling of the user dictionary. It is now possible to add common nouns or compounds in the user dictionary. I also added more tests to make sure that the dictionary contains valid data. In terms of feature I think it's ready, now I'll focus on cleanup and adding more tests.

@thetaphi, thanks for volunteering ! I think it would be nice to share some code with the Kuromoji but I agree with Robert, this is a lot of work and I didn't want to change the format and the processing of the Kuromoji too early. We can still reevaluate the feasibility of merging some code when we have a better idea of the final format for this analyzer.

[Legacy Jira: Jim Ferenczi (@jimczi) on Mar 30 2018]

mikemccand commented 6 years ago

Hi Jim, I dug into this a bit more to explain your results, just some notes from what regenerate prints:

I think the syllable fst is "unhealthy" in shape but all the strangeness is basically at the root, and the cache takes care of that. otherwise it goes sparse pretty quickly due to how the dictionary is organized. you don't need a cache at all for the character one.

i think the difference may be that this dictionary is structured differently: it contains full compounds as entries and each compound lists its decomposition. This is very different from LUCENE-4956 decompounding process, which didn't have full compounds as entries, making the dictionary 10x smaller.

[Legacy Jira: Robert Muir (@rmuir) on Mar 31 2018]

mikemccand commented 6 years ago

I looked at the recent patch, one thing we need to warn about is that this thing currently loads > 30MB of stuff into RAM.

20MB of that is the connection costs, which seems a bit excessive on the heap. Maybe this should be done with a direct shortbuffer similar to BinaryDictionary?

[Legacy Jira: Robert Muir (@rmuir) on Mar 31 2018]

mikemccand commented 6 years ago

There is still quite a bit of redundancy in the compound storage. Can we optimize the case where all the parts have the same POS (and that's also the POS of the compound itself)? we could steal an unused sign bit or something like that to indicate this, and it would save writing all those POS bytes for ~ 300k items. This seems to be the 90% case, with the notable exception of places (which seem to often be proper + general noun).

[Legacy Jira: Robert Muir (@rmuir) on Mar 31 2018]

mikemccand commented 6 years ago

Another thing to look at is if we really need two bytes for type + tag. i have trouble following this from the CSV file, but how many unique combinations are there really?

[Legacy Jira: Robert Muir (@rmuir) on Mar 31 2018]

mikemccand commented 6 years ago

Hi Robert, thanks for your testings and suggestions !

I pushed another patch that applies your suggestions. The connection costs is now loaded in a direct byte buffer on init (the matrix is still compressed on disk). I also changed the format of the binary dictionary to use 2 bits of the left id to encode the type (compound, inflect, morpheme or prenanalysis), added a POS dict that maps the left id and the part of speech tag and introduced a new flag to indicate entries with a single POS. The new size on disk is 25MB with 10MB on the heap at start since we don't load the connection costs in the heap anymore. 

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 02 2018]

mikemccand commented 6 years ago

I attached a new patch with lots of cleanups and fixes. I ran HantecRel again, here are the results:

Analyzer Index Time Index Size MAP(CLASSIC) MAP(BM25) MAP(GL2)
Korean 178s 90MB .1638 .2101 .2081

I am not sure why it got faster, could be the new compressed format, I'll dig.

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 02 2018]

mikemccand commented 6 years ago

Hi Jim, the latest changes look great. Thanks for optimizing it more!

Do you think there is an easy way to preserve the original compound with the decompoundfilter? Purely based on memory, I think kuromoji may do this by default, and maybe even our german decompounders too. From what I remember from relevance experiments in other languages, it seems like a pretty practical investment, especially since lucene's terms dict is good.

Mainly the concern i have is to have some handling for errors in the decompounding process. For this analyzer the representation of the model with respect to compounds is a little awkward, and I worry about OOV cases. So at least preserving can assist with that a bit. Might be an easy win.

[Legacy Jira: Robert Muir (@rmuir) on Apr 03 2018]

mikemccand commented 6 years ago

Hi Robert, I pushed another iteration that moves the decompound process and the POS filtering in the tokenizer. I think it's simpler to perform the decompound and the filtering directly in the tokenizer, this also allows to keep the compound token (I added a decompound mode option that disallow decompound (none), discard the decompound (discard) or perform the decompound and keep the original token (mixed)). By default the compound token is discarded but it can be kept using the mixed mode. I also changed the normalization option when building the dictionary, instead of adding the normalized form and the original form the builder now replaces the original form with the normalized one. By default the normalization is not activated but it can be useful for other Korean dictionaries that uses a decomposed form for hanguls like the Handic for instance: https://ja.osdn.net/projects/handic/ I added more tests and javadocs, I think it's getting closer ;)

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 04 2018]

mikemccand commented 6 years ago

I attached a new patch that fixes an issue with offsets of the compound nouns. Currently the patch outputs a single path and can also keep the original compound as well as the decompounds. I think we can add the N best paths in a follow up.

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 10 2018]

mikemccand commented 6 years ago

We may want to tweak the attributes reflection, or look at this POS enum some. It makes things a bit hard for debugging any analysis issues, e.g.

term=한국,bytes=[ed 95 9c ea b5 ad],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=MORPHEME,leftPOS=NNP,rightPOS=NNP,morphemes=null,reading=null
term=대단,bytes=[eb 8c 80 eb 8b a8],startOffset=4,endOffset=6,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=MORPHEME,leftPOS=XR,rightPOS=XR,morphemes=null,reading=null
term=나라,bytes=[eb 82 98 eb 9d bc],startOffset=8,endOffset=10,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=MORPHEME,leftPOS=NNG,rightPOS=NNG,morphemes=null,reading=null
term=이,bytes=[ec 9d b4],startOffset=10,endOffset=13,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=MORPHEME,leftPOS=VCP,rightPOS=VCP,morphemes=null,reading=null

Can we make it so the user doesn't have to decode the tag abbreviations, at least when using the toString (reflector stuff)? If they have to look this up from comments in the source code it will be more difficult. For example kuromoji:

term=多く,bytes=[e5 a4 9a e3 81 8f],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1,baseForm=null,partOfSpeech=名詞-副詞可能,partOfSpeech (en)=noun-adverbial,reading=オオク,reading (en)=ooku,pronunciation=オーク,pronunciation (en)=oku,inflectionType=null,inflectionType (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false
term=学生,bytes=[e5 ad a6 e7 94 9f],startOffset=3,endOffset=5,positionIncrement=2,positionLength=1,type=word,termFrequency=1,baseForm=null,partOfSpeech=名詞-一般,partOfSpeech (en)=noun-common,reading=ガクセイ,reading (en)=gakusei,pronunciation=ガクセイ,pronunciation (en)=gakusei,inflectionType=null,inflectionType (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false
term=試験,bytes=[e8 a9 a6 e9 a8 93],startOffset=6,endOffset=8,positionIncrement=2,positionLength=1,type=word,termFrequency=1,baseForm=null,partOfSpeech=名詞-サ変接続,partOfSpeech (en)=noun-verbal,reading=シケン,reading (en)=shiken,pronunciation=シケン,pronunciation (en)=shiken,inflectionType=null,inflectionType (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

An easy win related to this is to make enum values have real javadocs comments instead of source code ones.

e.g. today it looks like:

public enum Tag {
    // Infix
    E(100),
    // Interjection
    IC(110),

But if we make "Infix" and "Interjection" real javadocs comments then the user has a way to see the whole POS tagset in the generated documentation.

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

Thanks Robert. I attached a new patch that changes the enum to attach a description to each tag and reflected the description in javadocs comments. The toString reflection now returns the description of the POS tag:

KoreanTokenizer@22f9baea term=평창,bytes=[ed 8f 89 ec b0 bd],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=MORPHEME,leftPOS=NNP(Proper Noun),rightPOS=NNP(Proper Noun),morphemes=null,reading=null

... and the compounds are correctly rendered:

KoreanTokenizer@292528fd term=가락지나물,bytes=[ea b0 80 eb 9d bd ec a7 80 eb 82 98 eb ac bc],startOffset=0,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=COMPOUND,leftPOS=NNG(General Noun),rightPOS=NNG(General Noun),morphemes=가락지/NNG(General Noun)+나물/NNG(General Noun),reading=null

I also change the format for the Preanalysis token, they are now compressed using the same technic than for Compounds which gives another 2MB improvement over the last patch.

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 12 2018]

mikemccand commented 6 years ago

very nice. may want to look at generated javadocs (last patch i looked at had some precommit issues we may need to fix first) and take a stab at the overview.html? It is the first thing someone sees when they click the module's doc on the website, e.g.: https://lucene.apache.org/core/7_3_0/analyzers-icu/index.html

also i noticed DEFAULT_STOP_TAGS is in the TokenizerFactory, that makes the tokenizer very difficult to use from the lucene API. Can we push such defaults into the Tokenizer instead so that it is easier to instantiate etc?

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

I think I've seen this root arc caching technique in at least a couple other places. Perhaps the FST ought to natively do this? Or perhaps with an optional wrapper to add configurable toggles (i.e. ranges to cache) like I see here in TokenInfoFST?

[Legacy Jira: David Smiley (@dsmiley) on Apr 12 2018]

mikemccand commented 6 years ago

i don't think it should. it is very specific to what kuromoji/nori are doing. i think the caching FST does by default is already good.

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

I attached a new patch that passes precommit checks. The javadocs looks fine, all part of speech tags have a description attached. The DEFAULT_STOP_TAGS is now in the Tokenizer and is used by default when no tags are specified.

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 12 2018]

mikemccand commented 6 years ago

I still don't see a KoreanTokenizer ctor that uses these defaults? I only see ones with 4 or 5 required parameters.

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

Right, I changed the Analyzer but not the Tokenizer. I attached a new patch that adds two more ctor that use the default parameters.

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 12 2018]

mikemccand commented 6 years ago

If the UserDictionary is optional can we just have a no-arg ctor? I think building a custom dictionary is pretty expert.

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

Sure, I added two more ctr in the last patch, one with no-arg and one with only the AttributeFactory.

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 12 2018]

mikemccand commented 6 years ago

Do you think we should remove KoreanTokenizer(AttributeFactory, UserDictionary) and KoreanTokenizer(UserDictionary)?

This way we just have "defaults" and "everything" ctors.

It would also remove some compile-time ambiguity, KoreanTokenizer(AttributeFactory) vs KoreanTokenizer(UserDictionary), which is not good if null values are allowed. And I don't see passing a custom UserDictionary as being any less expert than changing the stopwords lists or preserving original compounds.

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

And i havent looked into why the tokenizer takes stoptags.

It may help to separate concerns better if this is moved to a tokenfilter like how kuromoji does it: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapanesePartOfSpeechStopFilter.java

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

I agree this will also simplify the understanding of these ctoes. I'll remove them and keep only the defaults and everything ctors. Regarding why the KoreanTokenizer takes stoptags, it is done to simplify the removal of tokens when we keep compounds since we need to set the position length of the compound token without the tokens that should be removed. Otherwise the stop tags filter should handle position length when it removes a token and I find it simpler to do it directly in the Tokenizer especially if we add the support for keeping the N best paths in a follow up.

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 12 2018]

mikemccand commented 6 years ago

Shouldn't FilteringTokenFilter be enough? It just requires you to return true/false.

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

No because FilteringTokenFilter doesn't handle positionLength so if it removes a token from a compound it needs to change the posLength of the original compound. I tried to write something to handle this case in the filtering token filter but it's not trivial and requires a lot of code so I choose the simple path of removing the tokens directly in the tokenizer.

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 12 2018]

mikemccand commented 6 years ago

I don't understand why it needs to change posLength, I think this is some other issue. Why is this analyzer any different than the japanese one for this case?

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

I think that the Japanese analyzer has the same issue and it's even worst since it can outputs multiple paths. If we have a compound "AB" that is decomposed into "A" and "B", if we remove "B" we need to change the posLength of "AB" to be 1 (instead of 2).

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 12 2018]

mikemccand commented 6 years ago

I didn't really see consensus on this issue though (there was some discussion about it on LUCENE-4065, and it seemed FilteringTokenFilter may be doing the right thing) definitely think its a concern unrelated to korean and we shouldn't put stopword filtering into our tokenizers yet until its understood and discussed.

[Legacy Jira: Robert Muir (@rmuir) on Apr 12 2018]

mikemccand commented 6 years ago

Ok I'll restore the KoreanPartOfSpeechStopFilter then and we can discuss LUCENE-4065 separately.

[Legacy Jira: Jim Ferenczi (@jimczi) on Apr 12 2018]