org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure [LUCENE-8819]

mikemccand commented 5 years ago

It can be reproduced with:

ant test  -Dtestcase=TestTopDocsMerge -Dtests.method=testSort_1 -Dtests.seed=E916688CE5BC9122 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-US -Dtests.timezone=Pacific/Johnston -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1

Test fails in master and branch 8.x but it does not fail in branch 8.1.

Legacy Jira details

LUCENE-8819 by Ignacio Vera (@iverase) on Jun 03 2019, updated Jun 10 2019 Attachments: LUCENE-8819.patch Linked issues:

LUCENE-8757
- LUCENE-8829

mikemccand commented 5 years ago

I think this test error was introduced by LUCENE-8757

[Legacy Jira: Ignacio Vera (@iverase) on Jun 03 2019]

mikemccand commented 5 years ago

I took a look at this and looks like the test failure is occuring from AssertingCollector's newly added check in LUCENE-8757 that asserts that LeafReaderContexts are ordered by docIDs i.e. a LeafReaderContext's docBase is greater than the predecessor's maxDoc.

I will dig deeper into this and update

[Legacy Jira: Atri Sharma (@atris) on Jun 03 2019]

mikemccand commented 5 years ago

Just for the record, I see other test failing and I believe it comes from the same issue, for example:

ant test  -Dtestcase=TestRegexpRandom2 -Dtests.method=testRegexps -Dtests.seed=215B9655D6767594 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-8.x/test-data/enwiki.random.lines.txt -Dtests.locale=es-PY -Dtests.timezone=Asia/Kuala_Lumpur -Dtests.asserts=true -Dtests.file.encoding=UTF-8

[Legacy Jira: Ignacio Vera (@iverase) on Jun 03 2019]

mikemccand commented 5 years ago

Acked, same assert.

[Legacy Jira: Atri Sharma (@atris) on Jun 03 2019]

mikemccand commented 5 years ago

The problem is that the assert does not account for docIDs within a single segment i.e. when there are multiple documents within a segment collected by the same leaf collector. The assert should only be checking at segment boudaries to ensure that subsequent segments have the right sequence of DocIDs.

Attached is a patch to fix the same. I ran ant check and the guilty tests passed.

@iverase Could you check if this fix passes at your end as well?

LUCENE-8819.patch

[Legacy Jira: Atri Sharma (@atris) on Jun 03 2019]

mikemccand commented 5 years ago

Thanks @atris!

Unluckily the patch does not apply with the latest master. I believe Adrien might have made some minor edits to your patch before applying it and it is conflicting with this patch. Maybe you want to pull the latest master and then do the changes.

Sorry for the hassle.

[Legacy Jira: Ignacio Vera (@iverase) on Jun 03 2019]

mikemccand commented 5 years ago

@ivera Hmm, I was working on the original forked branch for 8757, now rebased to master. Interestingly, the problem I described above does not occur in master, thanks to Adrien's replacement of the said code block.

Running the tests in IntelliJ did not reproduce the problem for me.Running from the commandline triggers the following assert. I am not sure if I understand how 8757 can affect this since 8757 primarily introduces two changes, one which is triggered only during building LeafSlice contexts (which in turn is invoked only for parallel search, and the failing tests do not do parallel segment reads AFAIK), and tightening up asserts in AssertingCollector (which are not getting tripped). Is this the same stack that you see in test failure for TestRegexpRandom2?:

at junit.framework.Assert.fail(Assert.java:57)
   [junit4]    > at org.apache.lucene.search.CheckHits.checkEqual(CheckHits.java:205)
   [junit4]    > at org.apache.lucene.search.TestRegexpRandom2.assertSame(TestRegexpRandom2.java:178)
   [junit4]    > at org.apache.lucene.search.TestRegexpRandom2.testRegexps(TestRegexpRandom2.java:164)

[Legacy Jira: Atri Sharma (@atris) on Jun 03 2019]

mikemccand commented 5 years ago

Note that you need to add the seed and the multiplier in IntelliJ. In that case I am able to reproduce it in the debugger.

What I observe is that documents come in different order and naively my best guess is that is due to the order of execution that 8757 has introduced, but I might be totally wrong :).

[Legacy Jira: Ignacio Vera (@iverase) on Jun 03 2019]

mikemccand commented 5 years ago

@ivera I did set the seed in the test run (I am assuming you are also setting it in the VM configuration in test config setup?)

I do agree with you that 8757 is what changes the order of segments for parallel runs. I will dive deeper to see if I can find the root cause. Thanks for your investigation on this :)

[Legacy Jira: Atri Sharma (@atris) on Jun 03 2019]

mikemccand commented 5 years ago

I set these two arguments in the VM configuration:

-Dtests.seed=E916688CE5BC9122 -Dtests.multiplier=3

[Legacy Jira: Ignacio Vera (@iverase) on Jun 03 2019]

mikemccand commented 5 years ago

@ivera Thanks, that worked for TestTopDocsMerge (not for TestRandomRegExp2 though, are you using different seed and multiplier there?)

I investigated TestTopDocsMerge failure and listed the issue in https://issues.apache.org/jira/browse/LUCENE-8824

Let me know if it makes sense.

[Legacy Jira: Atri Sharma (@atris) on Jun 04 2019]

mikemccand commented 5 years ago

The second test worked for me adding the following VM configuration:

-Dtests.nightly=true -Dtests.seed=215B9655D6767594 -Dtests.multiplier=2

Note the nightly flag.

The new issue makes sense to me, thanks!

[Legacy Jira: Ignacio Vera (@iverase) on Jun 04 2019]

mikemccand commented 5 years ago

I had a look into the error in TestRandomRegExp2 and it looks more concerning.

The test executes two equivalent queries and expects the same results. The problem is that it is using two different random searchers and in this case one is using multiple threads (executor != null) and the other one is single threaded (executor == null). The effect is the same as above, documents gets different shard_index and results are different.

It sounds wrong to me that depending how you construct your searcher you might get different results.

[Legacy Jira: Ignacio Vera (@iverase) on Jun 04 2019]

mikemccand commented 5 years ago

@ivera Yes, you should not be getting different results. Note that the results themselves are not different in this case (I verified that by printing the results per searcher) but the same problem as TestTopDocsMerge – for a null sort order, shardIndex comes into play when tie breaking on equal scores. However, shardIndex is a tricky one – it is assigned based on the position of the originating collector where the hit occured in the overall array of Collectors. Hence, the order in which the results are returned is different.

I will look into this a bit more.

I think that the dependency on shardIndex as a tie breaker is a bit gray – why would you want to tie break on something that is viable to change and is volatile? Why not tie break on docID instead, which is the de facto standard (AFAIK, TopDocs assumes that documents are collected in docID order, so why not use it as a tie breaker?

[Legacy Jira: Atri Sharma (@atris) on Jun 04 2019]

mikemccand commented 5 years ago

TopDocs#merge has two use-cases, one is to merge results that come from multiple slices of the same IndexSearcher, another one is to merge results that come from different IndexSearchers (shards). In the latter case, tie-breaking by doc ID is not enough as you could have documents in multiple shards that share the same doc ID, which is why it tie-breaks by shard ID first.

We seem to get bitten by the fact that the merging of results from different slices treats each slice as a different shard, so this gives the expected results if for every X < Y, doc IDs of slice X are all less than doc IDs of slice Y.

I haven't looked deeply, but I guess my preferred option would be to have a way to signal to TopDocs#merge that all hits come from the same shard so that it would tie break directly by doc ID. But it looks like it requires quite some changes as it would break a number of assumptions.

[Legacy Jira: Adrien Grand (@jpountz) on Jun 04 2019]

mikemccand commented 5 years ago

@jpountz Yes, the problem is that there is no disambiguation between when the incoming Collectors belong to different IndexSearchers or to the same IndexSearcher.

I opened https://issues.apache.org/jira/browse/LUCENE-8829 to track this issue.

@jpountz Would it not be simpler if IndexSearcher#searchAfter disabled setting shardIndices (setShardIndex=false?). TopDocs#merge could use that as a signal and use docID to tie break iff sort order == null && setShardIndex=false. WDYT?

[Legacy Jira: Atri Sharma (@atris) on Jun 04 2019]

mikemccand commented 5 years ago

@jpountz I have added a patch to LUCENE-8829 to essentially switch to docID only tie breaking in case all hits come from the same shard. Please let me know if that seems fine.

[Legacy Jira: Atri Sharma (@atris) on Jun 05 2019]

mikemccand / stargazers-migration-test

org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure [LUCENE-8819] #817

Legacy Jira details