HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183]

mikemccand commented 6 years ago

The HyphenationCompoundWordTokenFilter creates overlapping tokens even if onlyLongestMatch is enabled.

Example:

Dictionary: gesellschaft, schaft Hyphenator: de_DR.xml //from Apche Offo onlyLongestMatch: true

text	gesellschaft	gesellschaft	schaft
raw_bytes	[67 65 73 65 6c 6c 73 63 68 61 66 74]	[67 65 73 65 6c 6c 73 63 68 61 66 74]	[73 63 68 61 66 74]
start	0	0	0
end	12	12	12
positionLength	1	1	1
type	word	word	word
position	1	1	1

IMHO this includes 2 unexpected Tokens

the 2nd 'gesellschaft' as it duplicates the original token
the 'schaft' as it is a sub-token 'gesellschaft' that is present in the dictionary

Legacy Jira details

LUCENE-8183 by Rupert Westenthaler on Feb 23 2018, updated Jan 14 2021 Environment:

Configuration of the analyzer:

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HyphenationCompoundWordTokenFilterFactory" 
        hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
         dictionary="lang/wordlist_de.txt" 
        onlyLongestMatch="true"/>

Attachments: LUCENE-8183_20180223_rwesten.diff, LUCENE-8183_20180227_rwesten.diff, lucene-8183.zip Linked issues:

LUCENE-3022

mikemccand commented 6 years ago

Added a patch that shows how the unexpected behaviour could be fixed.

[Legacy Jira: Rupert Westenthaler on Feb 23 2018]

mikemccand commented 6 years ago

Hi, I have not noticed this with my dictionaries, I have to dig into that. Did you also check the German dictionary which is provided here: https://github.com/uschindler/german-decompounder

[Legacy Jira: Uwe Schindler (@uschindler) on Feb 23 2018]

mikemccand commented 6 years ago

@rwesten Quick question regarding your patch: What's the reasoning behind not decomposing terms that are part of the dictionary at all?

The onlyLongestMatch flag currently affects whether all matches or only the longest match should be returned per start character (in DictionaryCompoundWordTokenFilter) or per hyphenation start point (in HyphenationCompoundWordTokenFilter).

Example: Dictionary "Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft" for input "Wirtschaftswissenschaft" will return the original input plus tokens "Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". "schaft" is still returned (even twice) because it's the longest token starting at the respective position.

I like the idea of restricting this further to only the longest terms that touch a certain hyphenation point. This would exclude "schaft" in the example above (as "Wirtschaft" and "wissenschaft" are two longer terms encompassing the respective hyphenation point). On the other hand, there might be examples where you still want to include the "overlapping" tokens. For "Fußballpumpe" and dictionary "Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball" you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been considered part of Fußball. Also, not sure if your change also improves the situation for languages other than German.

Regarding point 1: The current algorithm always returns the term itself again if it's part of the dictionary. I guess, this could be changed if we don't check against this.maxSubwordSize but against Math.min(this.maxSubwordSize), termAtt.length()-1)

Perhaps these kind of adjustments should rather be done in a TokenFilter similar to RemoveDuplicatesTokenFilter instead of complicating the decompounding algorithm?

[Legacy Jira: Matthias Krueger (@mkr) on Feb 23 2018]

mikemccand commented 6 years ago

AFAIK I use exactly this dictionary and hyphenator config. I will provide a Solr core config that can be used to reproduce the described issue.

[Legacy Jira: Rupert Westenthaler on Feb 23 2018]

mikemccand commented 6 years ago

I was not aware that this is the intended behaviour.

For "Fußballpumpe" and dictionary "Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball" you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been considered part of Fußball. Also, not sure if your change also improves the situation for languages other than German.

Thats a good point. Maybe one should still consider parts that are not enclosed by an token that was already decomposed. So for Fußballpumpe: ball would be ignored as Fußball is already present, but ballpumpe would still be added as token. Finally pumpe is ignored as ballpumpe is present.

This reminds me to ALL, NO_SUB and LONGEST_DOMINANT_RIGHT as supported by the Solr Text Tagger

Perhaps these kind of adjustments should rather be done in a TokenFilter similar to RemoveDuplicatesTokenFilter instead of complicating the decompounding algorithm?

I am aware of this possibility. In fact I do use the RemoveDuplicatesTokenFilter to remove those tokens. My point was just why they are added in the first place.

[Legacy Jira: Rupert Westenthaler on Feb 23 2018]

mikemccand commented 6 years ago

@rwesten: I was not aware that this was my dictionary file! The names in your example (under "environment in your report) did not look like the example listed here: https://github.com/uschindler/german-decompounder

[Legacy Jira: Uwe Schindler (@uschindler) on Feb 23 2018 [updated: Feb 24 2018]]

mikemccand commented 6 years ago

I am aware of this possibility. In fact I do use the RemoveDuplicatesTokenFilter to remove those tokens. My point was just why they are added in the first place.

I think it's good to not add them in the first place. The change is quite simple, so it can be done here. And it does not really complicate the algorithm as its done at one separated place.

[Legacy Jira: Uwe Schindler (@uschindler) on Feb 24 2018]

mikemccand commented 6 years ago

I have to check the algorithm, but to make this patch into lucene, the test cases need to be adapted to check this new behaviour.

[Legacy Jira: Uwe Schindler (@uschindler) on Feb 24 2018]

mikemccand commented 6 years ago

FYI: I pan to spend some time to implement a version of the DictionaryCompoundWordTokenFilter that adds options for

noSub: no tokens are added the are completely enclosed by an longer (fußballpumpe: fußball, ballpumpe)
noOverlap: no overlapping tokens (fußballpumpe; fußball, pumpe)

IMO the simplest way is to first emit all tokens and later filter those based on the active options (onlyLongestMatch, noSub, noOverlap).

Regarding the test:

Providing good test examples is hard as the current test cases are based on a Danish and I do not speak this language Providing examples in German would be easy, but this would require a German hyphenator and the file is licensed under the LaTeX Project Public License and can therefore not be included in the source. Given suitable examples the implementation of the actual test seams to be rather easy as they can be implemented similar to the existing test cases

[Legacy Jira: Rupert Westenthaler on Feb 26 2018]

mikemccand commented 6 years ago

You might want to have a look at "mocking" the HyphenationTree (see my patch for LUCENE-8185) which simplifies writing a decompounding test.

[Legacy Jira: Matthias Krueger (@mkr) on Feb 26 2018]

mikemccand commented 6 years ago

Thats helpful indeed! thx

[Legacy Jira: Rupert Westenthaler on Feb 26 2018]

mikemccand commented 6 years ago

Patch: LUCENE-8183_20180227_rwesten.diff

New Parameters:

noSubMatches: true/false
noOverlappingMatches: true/false

together with the existing onlyLongestMatch those can be used to define what subwords should be added as tokens. Functionality is as described above.

Typically users will only want to include one of the three attributes as enabling noOverlappingMatches is the most restrictive and noSubMatches is more restrictive as onlyLongestMatch. When enabling a more restrictive option the state of the less restrictive does not have any effect.

Because of that it would be an option to refactor this to an single attribute with different setting, but this would require to think about backward compatibility for configurations that do use onlyLongestMatch=true at the moment.

Algorithm

If processing of subWords is deactivated (any of onlyLongestMatch, noSubMatches, noOverlappingMatches is active) the algorithm first checks if the token is part of the dictionary. If so it returns immediately. This is to avoid adding tokens for subwords if the token itself is in the dictionary (see #testNoSubAndTokenInDictionary for more info).

I changed the iteration direction of the inner for loop to start with the longest possible subword as this simplified the code.

NOTE: that this also changes the order of the Tokens in the token stream but as all tokens are at the same position that should not make any difference. I had however to modify some existing tests as those where sensitive to the ordering

h3 Tests

I added two test methods in TestCompoundWordTokenFilter

#testNoSubAndNoOverlap() tests the expected behaviour of the noSubMatches and noOverlappingMatches options
#testNoSubAndTokenInDictionary() tests that no tokens for subwords are added in the case that the token in part of the dictionary

In addition TestHyphenationCompoundWordTokenFilterFactory#testLucene8183() asserts that the new configuration options are parsed.

h3 Environment

This patch is based on master from git@github.com:apache/lucene-solr.git commit: d512cd7604741a2f55deb0c840188f0342f1ba57

[Legacy Jira: Rupert Westenthaler on Feb 27 2018]

mikemccand commented 4 years ago

Hi @uschindler,

I have created a pull request https://github.com/apache/lucene-solr/pull/2014 with the changes from @rwesten.

[Legacy Jira: Martin Demberger on Oct 21 2020]

mikemccand commented 4 years ago

Hi, thanks Martin! Let me get uptodate with the patch and let's see. To me the latest changes look fine, but I wasn't able to think about it.

[Legacy Jira: Uwe Schindler (@uschindler) on Oct 21 2020]

mikemccand commented 3 years ago

Hi Uwe,

is there any way I can fasten the merge. We have a few customer which wait for this fix because our search very often falls into this pithole.

[Legacy Jira: Martin Demberger on Nov 02 2020]

mikemccand commented 3 years ago

Is there any update on this? I have updated the PR so it approves to the new checks.

[Legacy Jira: Martin Demberger on Jan 14 2021]

mikemccand / stargazers-migration-test