mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

SynonymGraphFilter cannot handle input stream with tokens filtered. [LUCENE-8985] #982

Open mikemccand opened 4 years ago

mikemccand commented 4 years ago

@janhoy find the bug.

In an analyzer with e.g. stopFilter where tokens are removed from the stream and replaced with a “hole”, synonymgraphfilter will not preserve these holes but remove them, resulting in certain phrase queries failing.


Legacy Jira details

LUCENE-8985 by Chongchen Chen on Sep 21 2019, updated Sep 02 2020 Attachments: SGF_SF_interaction.patch.txt

mikemccand commented 4 years ago

Thanks for the fix.

Can you please fill in the PR CHECKLIST template that appeared  opened the PR?

And also pleas explain what the fix does, I can see the code but how can we be sure it is the right fix? :)

[Legacy Jira: Jan Høydahl (@janhoy) on Sep 21 2019]

mikemccand commented 4 years ago

I have updated the pull request. @janhoy

[Legacy Jira: Chongchen Chen on Sep 22 2019]

mikemccand commented 4 years ago

Would be great if we could complete this in time for 8.3.0

[Legacy Jira: Jan Høydahl (@janhoy) on Sep 23 2019]

mikemccand commented 4 years ago

@mikemccand  can you answer? And perhaps review the PR?

[Legacy Jira: Jan Høydahl (@janhoy) on Sep 24 2019]

mikemccand commented 4 years ago

I'd love to see this go in 8.4 but I need help reviewing. [~msokolov@gmail.com]?

[Legacy Jira: Jan Høydahl (@janhoy) on Nov 23 2019]

mikemccand commented 4 years ago

I looked at the patch - it seems quite a significant change: maybe a good one?! I'm not entirely clear what the goal is though - the bug description in this issue is pretty light on details. Can we explain here what the current behavior of PhraseQuery and other positional queries is w.r.t holes more generally, ignoring synonyms, and then how it currently works (is broken) in the presence of synonyms? I don't have enough context to review this. I'm willing to help out, but I'd need more to go on, and I can't really commit to the 8.4 release schedule, sorry @janhoy

[Legacy Jira: Michael Sokolov (@msokolov) on Nov 23 2019]

mikemccand commented 4 years ago

I was hoping the new tests would describe the bug. The current sgf code does not handle holes in the token stream caused by removing tokens in e.g. stopfilter. When holes are remover then a phrase query may get wrong match. Simplified example:

Document: Please clean the screen After stopfilter: Please clean * screen

Query: “clean the monitor” After stopfilter: “clean * monitor” After sgf: “clean screen|monitor” No match

[Legacy Jira: Jan Høydahl (@janhoy) on Nov 23 2019]

mikemccand commented 4 years ago

Coming back to this. Is the bug well understood? Is there perhaps some simpler way to fix the bug than the current PR proposal?

[Legacy Jira: Jan Høydahl (@janhoy) on Dec 09 2019]

mikemccand commented 4 years ago

To my way of thinking this is more of a bug with how we do position queries in the presence of stop word "holes" rather than something specifically related to synonyms. For example, wouldn't we expect a phrase query "paris in spring" to match "paris in the spring"? And, conversely what would the harm be if "Mott Hoople" matched "Mott the Hoople."

But these are just some anecdotes, perhaps there is a history here of why we preserve and respect holes that I'm lacking.

Still, given that we cannot index position length, our multi-word synonyms and other overlapping multi-token runs produce inadvertent phrase queries already, so trying to be very precise about holes seems like a stretch.

I'm not sure I have an alternate solution to advocate - maybe removing holes altogether when indexing and querying? Maybe making PhraseQuery somehow treat holes as nonexistent (that doesn't seem workable though).

And in the end, none of that really militates against this patch; we have holes, we should handle them consistently, which I guess means that SGF should preserve them. I think we just need to make sure the implementation is otherwise sound. It seems as if it a minimum it passes our unit tests. It's hard to tell by inspection if it has bugs, though. I'd be more confident if we had some experience with query/document test suite where we can verify that known good results are not impacted in some way. Do we have any such test bed though?

[Legacy Jira: Michael Sokolov (@msokolov) on Dec 09 2019]

mikemccand commented 4 years ago

Thanks for chiming in Michael. Not sure where the "holes" approach came from but it's been that way as long as I can remember, so we should respect those holes. I also agree that this same behaviour of holes may potentially be an issue with other filters such as WDGF but in the isolated context of input/output expectations of SGF, the fix must be in that component.

The patch is in a too complex area for me to confidently OK it. So perhaps if we added a bunch of unit tests for various input/output expectaions of today's SGF and commit those first, then we could add even more tests to this PR to assert expected behaviours with "holes".

[Legacy Jira: Jan Høydahl (@janhoy) on Dec 09 2019]

mikemccand commented 3 years ago

Any thoughts on the way forward for this?

[Legacy Jira: Jan Høydahl (@janhoy) on Sep 02 2020]

mikemccand commented 3 years ago

Thank you @nppoly for the thorough PR – I will try to review again soon.

Holes are a challenge for graph token filters.  I have long felt that stop words (and other holey tokens) should not be removed from the token stream, but rather a new DeletedAttribute would mark the token as deleted (and not to be indexed) but still the token would remain (and be incrementToken'd) to record the metadata about that token for the sake of future token filter stages.  E.g. this would allow analyzers that mark stopwords for deletion but then e.g. a SynonymGraphFilter could still apply over the stopwords, e.g. lord of the rings could still match properly even if of and the were marked as deleted.  But, that is a much larger change, and no need to hold up this first approach for that!

[Legacy Jira: Michael McCandless (@mikemccand) on Sep 02 2020]