mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

GraphTokenStreamFiniteStrings.FiniteStringsTokenStream does not play well with subsequent TokenFilters [LUCENE-8916] #913

Closed mikemccand closed 5 years ago

mikemccand commented 5 years ago

GraphTokenStreamFiniteStrings provides a view over multiple paths through a Token graph, which is useful when building queries over multiple length synonyms. This view is exposed as an iterator over simple TokenStreams. However, these TokenStreams do not work correctly when further wrapped in token filters, because they do not use a CharTermAttribute.

For an example of issues this can cause, see https://github.com/elastic/elasticsearch/issues/43976, where elasticsearch uses a special shingle field to speed up phrase searches. Queries are converted to shingles if they have multiple terms. However, if the query resolves into a graph due to synonyms, then this conversion breaks because the FixedShingleFilter is given a token stream built by GTSFS; terms are set using BytesTermAttribute, but then read using CharTermAttribute, and as these have different backing implementations, FSF ends up emitting null tokens.


Legacy Jira details

LUCENE-8916 by Alan Woodward (@romseygeek) on Jul 15 2019, resolved Jul 19 2019

mikemccand commented 5 years ago

Interestingly, the patch attached to LUCENE-8644 will fix this, as it makes FTSFS clone all attributes, rather than just saving terms and playing them back again in a synthetic token stream.

[Legacy Jira: Alan Woodward (@romseygeek) on Jul 15 2019]

mikemccand commented 5 years ago

Commit 1eb2a26c6cc9346827a321c3f883f17ea94ea013 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1eb2a26

LUCENE-8916: GraphTokenStreamFiniteStrings preserves all attributes

[Legacy Jira: ASF subversion and git services on Jul 19 2019]

mikemccand commented 5 years ago

Commit 1ccef967677d4eeab4c162b7c0d6eeb81ebd5281 in lucene-solr's branch refs/heads/master from Alan Woodward https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1ccef96

LUCENE-8916: GraphTokenStreamFiniteStrings preserves all attributes

[Legacy Jira: ASF subversion and git services on Jul 19 2019]