mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Make TestRandomChains check that filters preserve positions [LUCENE-8361] #361

Open mikemccand opened 6 years ago

mikemccand commented 6 years ago

Follow-up of LUCENE-8360: it is a bit disappointing that we only found this issue because of a newly introduced token filter. I'm wondering that we might be able to make TestRandomChains detect more bugs by verifying that the sum of position increments is preserved through the whole analysis chain.


Legacy Jira details

LUCENE-8361 by Adrien Grand (@jpountz) on Jun 18 2018, updated Jun 29 2018 Attachments: LUCENE-8361.patch

mikemccand commented 6 years ago

Here's a patch that adds a step to BaseTokenStreamTestCase.checkAnalysisConsistency.

I've already found a failure in MinHashTokenFilter, which raises the question of whether we still expect end() to report the total number of original tokens for filters that summarise the entire stream - the same will apply to FingerprintFilter and ConcatenateGraphFilter. I'm minded to just exclude filter chains that contain any of these from the test.

[Legacy Jira: Alan Woodward (@romseygeek) on Jun 18 2018]

mikemccand commented 6 years ago

TestRandomChains found another failure due to this, in LimitTokenOffsetFilter.

[Legacy Jira: Alan Woodward (@romseygeek) on Jun 21 2018]

mikemccand commented 6 years ago

Both those limit-filters are broken and buggy by default, they won't consume all the tokens unless you pass some special boolean.

[Legacy Jira: Robert Muir (@rmuir) on Jun 21 2018]

mikemccand commented 6 years ago

Both those limit-filters are broken and buggy by default

They're designed to short-cut tokenization (mainly for highlighting, I think) - do we have a non-buggy way of not consuming all tokens? Because I can see that it's a valid thing to do in some circumstances.

[Legacy Jira: Alan Woodward (@romseygeek) on Jun 29 2018]

mikemccand commented 6 years ago

We should disable some checks when the analysis chain includes one of these token filters that truncate the token stream. If we made them correct so that they consumed the wrapped token stream to eg. have access to the end offset, it would defeat their purpose a bit, which is to reduce the amount of work to do.

[Legacy Jira: Adrien Grand (@jpountz) on Jun 29 2018]

mikemccand commented 6 years ago

Strongly disagree with making our tests wimpy because of these buggy filters.

These filters have the ability to be correct or incorrect, the problem is that they choose to default to incorrect behavior. They should default to correct instead, and the ctor with the boolean option should be blacklisted in the test (since its known to be buggy). They will still achieve their purpose, but yeah it means the user should pay more attention to where they sit in the analysis chain and so on, that's all.

[Legacy Jira: Robert Muir (@rmuir) on Jun 29 2018]

mikemccand commented 6 years ago

It is also bad they behave this way by default as I imagine it only causes problems for highlighting fields with multiple values and so on if they are used.

[Legacy Jira: Robert Muir (@rmuir) on Jun 29 2018]

mikemccand commented 6 years ago

They're designed to short-cut tokenization (mainly for highlighting, I think) - do we have a non-buggy way of not consuming all tokens? Because I can see that it's a valid thing to do in some circumstances.

Yes, these filters have a boolean option to do this correctly. Its just not the default. This is really too bad, since somehow these bugs (which are like implementation details of how particular highlighters worked) made their way into the analysis module in such a way that its easy to put wrong offsets into your index.

[Legacy Jira: Robert Muir (@rmuir) on Jun 29 2018]

mikemccand commented 6 years ago

Only blacklisting the bad constructor works for me too.

[Legacy Jira: Adrien Grand (@jpountz) on Jun 29 2018]

mikemccand commented 6 years ago

Yes, the issue is that currently, all the constructors are bad :)

[Legacy Jira: Robert Muir (@rmuir) on Jun 29 2018]

mikemccand commented 6 years ago

Even if we fix the default ctor to behave as correct=true, I think these filters will still have problems because they don't do anything with position increments? So they can easily stop at the wrong place (eg middle of a graph), cause wrong posLengths, etc.

[Legacy Jira: Robert Muir (@rmuir) on Jun 29 2018]