Can we merge small segments during refresh, for faster searching? [LUCENE-8962]

mikemccand commented 5 years ago

Two improvements were added: 8.6 has merge-on-commit (by Froh et. all), 8.7 has merge-on-refresh (by Simon). See `MergePolicy.findFullFlushMerges`

The original description follows:

With near-real-time search we ask IndexWriter to write all in-memory segments to disk and open an IndexReader to search them, and this is typically a quick operation.

However, when you use many threads for concurrent indexing, IndexWriter will accumulate write many small segments during refresh and this then adds search-time cost as searching must visit all of these tiny segments.

The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve `IndexWriter'`s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader? It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ...

One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ...

I'm not yet sure how best to solve this, so I wanted to open an issue for discussion!

Legacy Jira details

LUCENE-8962 by Michael McCandless (@mikemccand) on Sep 03 2019, resolved Jun 27 2020 Attachments: failed-tests.patch, failure_log.txt, image-2021-03-22-10-36-32-201.png, LUCENE-8962_demo.png, test.diff Linked issues:

LUCENE-8331
- SOLR-14582
- LUCENE-8965

mikemccand commented 5 years ago

At Salesforce I worked on a custom merge policy to better address handling of small segments than TieredMergePolicy's choices. What's disappointing about TMP is that TMP insists on merging getSegmentsPerTier() (10) segments, even when they are small (below getFloorSegmentMB()). Instead we wanted some "cheap merges" of a smaller number of segments (even as few as 3 for us) that solely consist of the small segments. This cut our average segment count in half, although cost us more I/O – a trade-off we were happy with. I'd like to open-source this, perhaps as a direct change to TMP with defaults to do a similar amount of I/O but averaging fewer segments. The difficult part is doing simulations to prove out the theories.

Additionally, I worked on a custom MergeScheduler that executed those "cheap merges" synchronously (directly in the calling thread) while having the regular other merges pass through to the concurrent scheduler. The rationale wasn't tied to NRT but I could see NRT benefiting from this if getting an NRT searcher calls out to the merge code (I don't know if it does).

Perhaps your use-case could benefit from this as well. Unlike what you propose in the description, it doesn't involve changes/features to Lucene itself. WDYT?

[Legacy Jira: David Smiley (@dsmiley) on Sep 04 2019]

mikemccand commented 5 years ago

Thanks @dsmiley.

That sounds like a nice improvement to TMP, but I would want to match the more aggressive merging of those tiny segments w/ a refresh or commit, not run them in general since I think for pure indexing that'd hurt indexing throughput. I think the tricky part of this change is fixing IndexWriter refresh or commit to let the merge policy know it should now aggressively merge small segments, within a time or total size budget or something, while the refresh/commit operation waits, so that the returned segments have been merged, even while (concurrently) new segments are flushed.

Synchronous merges in merge scheduler sound interesting for this use case – maybe open a separate issue for that?

[Legacy Jira: Michael McCandless (@mikemccand) on Sep 04 2019]

mikemccand commented 5 years ago

The existing MergeTrigger mechanism is a way to conditionally activate this stuff. We already did that; I forgot to mention it. Thus it's only on a commit (not a flush) that activates this behavior. I'll create an issue for the merge scheduler and link here.

[Legacy Jira: David Smiley (@dsmiley) on Sep 04 2019]

mikemccand commented 4 years ago

I ended up needing something like this, not for NRT readers, but rather on commit.

I added a mechanism to compute cheap "commit merges" from within the prepareCommitInternal() call, and block until they complete (updating the "toCommit" SegmentInfos as they finish). I posted a PR for that here: https://github.com/apache/lucene-solr/pull/1155

I think we could do something similar from IndexWriter.getReader() to handle the NRT case, but I haven't tried working on that yet.

[Legacy Jira: Michael Froh on Jan 08 2020]

mikemccand commented 4 years ago

Here's a before and after comparison of the average number of segments searched per request since I applied this change (with a TieredMergePolicy subclass that tries to merge all segments smaller than 100MB into a single segment on commit, with floorSegmentMB of 500). It lowers the overall count, but especially significantly reduced the variance.

[Legacy Jira: Michael Froh on Jan 19 2020]

mikemccand commented 4 years ago

Thanks [~msokolov@gmail.com] for the feedback on the PR! I've updated it to incorporate your suggestions.

[Legacy Jira: Michael Froh on Jan 28 2020]

mikemccand commented 4 years ago

@msfroh as you can see above, I accomplished the effect here already in a different way without modifying Lucene. Not that I think we shouldn't modify Lucene altogether but I think the changes can be limited to implementations of MergePolicy & MergeScheduler without needing to modify the abstractions themselves or core Lucene, which are already sufficient. See LUCENE-8331 for a benchmark utility. I should resume this work.

[Legacy Jira: David Smiley (@dsmiley) on Jan 28 2020]

mikemccand commented 4 years ago

I finally got around to contributing the MergePolicy I developed for Salesforce and I linked it to this issue. I named it "EagerCheapMergePolicy" for the contribution. See the linked PR https://github.com/apache/lucene-solr/pull/1222 It makes no changes to Lucene internals. This component is one of two necessary pieces; the other is the merge scheduler that I already referred to.

Maybe we should debate here/now the merits of changing Lucene's APIs vs not. I'm showing here that it's not necessary; the APIs are sufficiently expressive to plug in and accomplish this feature, and it seems straight-forward to me (not a hack).

[Legacy Jira: David Smiley (@dsmiley) on Jan 29 2020]

mikemccand commented 4 years ago

@dsmiley IW normally kicks of merges after flushing and committing new segments, so I don't see how a merge policy can actually reduce the committed segment count?

Or, do you require the caller to call commit, waitForMerges, commit again, when using this policy? (And to not use any concurrent indexing threads)?

[Legacy Jira: Michael McCandless (@mikemccand) on Jan 30 2020]

mikemccand commented 4 years ago

Woah; I defer to your expertise @mikemccand. IndexWriter has a huge bus factor and I haven't delved into it. Still... I want to confirm what I think you are telling me. Based on my understanding of when merges are triggered and observed (by a reader/searcher), I wrote the following test on TestIndexWriterMergePolicy:

  public void testMergeOnCommitIsSearchable() throws IOException {
    try (
        Directory dir = newDirectory();
        IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random()))
            .setMaxBufferedDocs(10)
            .setMergePolicy(new LogDocMergePolicy())
            .setMergeScheduler(new SerialMergeScheduler()))
    ) {
      for (int i = 0; i < 99; i++) {
        addDoc(writer);
        checkInvariants(writer);
      }
      assertEquals(9, writer.getSegmentCount());
      assertEquals(9, writer.getNumBufferedDocuments());
      writer.commit();
      try (DirectoryReader reader = DirectoryReader.open(writer)) {
        assertEquals(1, reader.getSequentialSubReaders().size());
      }
    }
  }

Generally speaking and scene here, after a commit, a search application will open a new reader to be able to search over the recently committed documents. In the scenario above, I index a bunch of documents, some of which have been flushed already, some pending. Also notice the SerialMergeScheduler so that the writing thread merges in-process / synchronously. Then see I open a NRT reader from the writer and count the segments. I get 1, because the flushed buffer will produce the 10th segment and the configured LogDocMergePolicy will merge altogether on 10.

The test passes. So; am I missing something?

[Legacy Jira: David Smiley (@dsmiley) on Jan 30 2020]

mikemccand commented 4 years ago

@dsmiley – in your test, the merge executes after the commit updates the IndexWriter's live SegmentInfos. When you call DirectoryReader.open, it takes another clone of that live SegmentInfos (which has 1 segment).

However, the clone of the SegmentInfos that was written in the commit is from before the merge. If you were to open a fresh DirectoryReader from the on-disk directory, I believe you would still see 9 segments.

With the approach I took, the cheap merge (or merges) asynchronously updates the commit's SegmentInfos clone before the commit happens.

[Legacy Jira: Michael Froh on Jan 30 2020]

mikemccand commented 4 years ago

IndexWriter has a huge bus factor

You mean small bus factor!

and I haven't delved into it.

You should dive into IW, to increase the bus factor.

The test passes. So; am I missing something?

Well, using SerialMergeScheduler allows the test to pass, since the merges kicked off due to new segments after the commit will run, synchronously (using the main thread in your test) to completion. And then you open a new NRT IndexReader directly from IndexWriter that sees only the one merged segment. If you make the test more realistic (use concurrent indexing threads, ConcurrentMergeScheduler), the assertion should fail. Or, if you opened the IndexReader from Directory instead, it should also fail.

I think in order to see the actual committed SegmentInfos reflect the "cheap" merges, we need to take an approach similar to @msfroh's PR.

[Legacy Jira: Michael McCandless (@mikemccand) on Jan 31 2020]

mikemccand commented 4 years ago

Indeed, I meant "small bus factor" :) I'm looking deeper at Michael Froh's PR this weekend, and by necessity the pertinent parts of IndexWriter.

Well, using SerialMergeScheduler allows the test to pass, since the merges kicked off due to new segments after the commit will run, synchronously (using the main thread in your test) to completion.

Yes, it's not "realistic", but my objective in this code snippet was merely to demonstrate that the combination of a merge policy and a merge scheduler have the ability to affect the searchable segments on commit when using the NRT Reader/Searcher. Apparently it doesn't work if a normal (non-NRT Reader/Searcher) is opened; I can see that. Maybe this is a shortcoming of IndexWriter; why shouldn't IW be consistent on this matter?

It's not apparent to me that we need a new method on the MergePolicy when the MergeTrigger parameter is able to differentiate the types of merges so that a MP is able to behave differently depending on the circumstance. Am I unclear on this?

[Legacy Jira: David Smiley (@dsmiley) on Feb 01 2020]

mikemccand commented 4 years ago

Yes, it's not "realistic", but my objective in this code snippet was merely to demonstrate that the combination of a merge policy and a merge scheduler have the ability to affect the searchable segments on commit when using the NRT Reader/Searcher.

OK indeed you are right – that particular combination will in fact make the "merged after commit" segments visible to the subsequent searchers.

Apparently it doesn't work if a normal (non-NRT Reader/Searcher) is opened; I can see that. Maybe this is a shortcoming of IndexWriter; why shouldn't IW be consistent on this matter?

I don't think this is a shortcoming of IW. Rather, this is the salient difference between non-NRT and NRT readers – the latter get to see the "latest" in-memory segments changes in IndexWriter while the former see only precisely what was last committed.

It's not apparent to me that we need a new method on the MergePolicy when the MergeTrigger parameter is able to differentiate the types of merges so that a MP is able to behave differently depending on the circumstance. Am I unclear on this?

Yeah I think you are right! That would be a nice simplification. Probably this can just be folded into the existing MergePolicy API as a different MergeTrigger. Though then I wonder why e.g. forceMerge or expungeDeletes are not also simply different triggers ... @msfroh what do you think?

[Legacy Jira: Michael McCandless (@mikemccand) on Feb 01 2020]

mikemccand commented 4 years ago

Yeah I think you are right! That would be a nice simplification. Probably this can just be folded into the existing MergePolicy API as a different MergeTrigger. Though then I wonder why e.g. forceMerge or expungeDeletes are not also simply different triggers ... Michael Froh what do you think?

As I was first writing this, I added a MergeTrigger.COMMIT value and used that, rather than adding a dedicated method.

Then I realized that any time I've ever written a custom implementation of MergePolicy.findMerges(), I've ignored the MergeTrigger value, because I didn't really care what triggered the merge – I just wanted to define the MergeSpecification. Even TieredMergePolicy.findMerges()} doesn't look at the MergeTrigger parameter.

If I had made IndexWriter call findMerges with a MergeTrigger.COMMIT trigger, anyone with a similar MergePolicy would have probably ended up running (and blocking on) some pretty expensive merges on commit. The best way I could think of to be backwards compatible with the "old" behavior by default was to add a no-op method to the base class.

Looking through the history, it looks like forceMerge and expungeDeletes predate MergeTrigger, so that could explain them.

I really like the idea of controlling this with a MergeTrigger, but I'm concerned about breaking existing MergePolicy implementations that ignore the MergeTrigger (which I suspect may be most of them).

[Legacy Jira: Michael Froh on Feb 03 2020]

mikemccand commented 4 years ago

Ahh that's a very good point @msfroh – +1 to add a new method for better back compat.

[Legacy Jira: Michael McCandless (@mikemccand) on Feb 04 2020]

mikemccand commented 4 years ago

+1 makes sense; I like the explicitness of the feature. It's a shame that in prepareCommitInternal, already a very complex method, suddenly is much more complex. This seems par for the course in IndexWriter that demands you look at it in a monitor in portrait mode (thankfully I have one).

I think IndexWriter would benefit from all merge-related implementation going off to a helper class (package accessible). It would help mere mortals digest the complexity of IndexWriter. Congrats Michael Froh for figuring this gnarly stuff out!

[Legacy Jira: David Smiley (@dsmiley) on Feb 05 2020]

mikemccand commented 4 years ago

Thanks @dsmiley and @msfroh! I think this is ready? I'll work on merging the PR soon ...

[Legacy Jira: Michael McCandless (@mikemccand) on Feb 18 2020]

mikemccand commented 4 years ago

Commit 043c5dff6f44c9bb2415005ac97db3c2c561ab45 in lucene-solr's branch refs/heads/master from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=043c5df

LUCENE-8962: Add ability to selectively merge on commit (#1155)

LUCENE-8962: Add ability to selectively merge on commit

This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns.

If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns.

Add missing Javadoc
Fix incorrect comment
Refactoring and fix intermittent test failure

Made some changes to the callback to update toCommit, leveraging SegmentInfos.applyMergeChanges.
I realized that we'll never end up with 0 registered merges, because we throw an exception if we fail to register a merge.
Moved the IndexWriterEvents.beginMergeOnCommit notification to before we call MergeScheduler.merge, since we may not be merging on another thread.
There was an intermittent test failure due to randomness in the time it takes for merges to complete. Before doing the final commit, we wait for pending merges to finish. We may still end up abandoning the final merge, but we can detect that and assert that either the merge was abandoned (and we have > 1 segment) or we did merge down to 1 segment.

Fix typo
Fix/improve comments based on PR feedback
More comment improvements from PR feedback
Rename method and add new MergeTrigger

Renamed findCommitMerges -> findFullFlushMerges.
Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to MergeScheduler when merging on commit.

Update renamed method name in strings and comments

[Legacy Jira: ASF subversion and git services on Mar 02 2020]

mikemccand commented 4 years ago

Commit 043c5dff6f44c9bb2415005ac97db3c2c561ab45 in lucene-solr's branch refs/heads/master from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=043c5df

LUCENE-8962: Add ability to selectively merge on commit (#1155)

LUCENE-8962: Add ability to selectively merge on commit

This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns.

If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns.

Add missing Javadoc
Fix incorrect comment
Refactoring and fix intermittent test failure

Made some changes to the callback to update toCommit, leveraging SegmentInfos.applyMergeChanges.
I realized that we'll never end up with 0 registered merges, because we throw an exception if we fail to register a merge.
Moved the IndexWriterEvents.beginMergeOnCommit notification to before we call MergeScheduler.merge, since we may not be merging on another thread.
There was an intermittent test failure due to randomness in the time it takes for merges to complete. Before doing the final commit, we wait for pending merges to finish. We may still end up abandoning the final merge, but we can detect that and assert that either the merge was abandoned (and we have > 1 segment) or we did merge down to 1 segment.

Fix typo
Fix/improve comments based on PR feedback
More comment improvements from PR feedback
Rename method and add new MergeTrigger

Renamed findCommitMerges -> findFullFlushMerges.
Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to MergeScheduler when merging on commit.

Update renamed method name in strings and comments

[Legacy Jira: ASF subversion and git services on Mar 02 2020]

mikemccand commented 4 years ago

Tests seemed to pass consistently on master, so I pushed there, but now I cherry-picked to 8x branch and see consistent failures there:

[junit4] Tests with failures [seed: 5036466712CCD5FE]: [junit4] - org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock [junit4] - org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeOnCommit

[Legacy Jira: Michael Sokolov (@msokolov) on Mar 02 2020]

mikemccand commented 4 years ago

I'm looking into the branch_8x failures.

I'm able to reproduce on my machine and will step through to see what's different.

[Legacy Jira: Michael Froh on Mar 02 2020]

mikemccand commented 4 years ago

I think the failure in testMergeOnCommit occurs because of a difference in the random behavior of the test.

Specifically, sometimes the last writing thread happens to choose to commit() at the end, so there are no pending changes by the time we do the last commit() which should merge all segments (or abandon the merge, if it takes too long).

If we add one more doc before that last commit (ensuring that the anyChanges check in IndexWriter.prepareCommitInternal() is true), the test passes consistently.

I'm not sure why we don't see the same failure sometimes on master, though.

[Legacy Jira: Michael Froh on Mar 02 2020]

mikemccand commented 4 years ago

Regarding TestIndexWriter.testThreadInterruptDeadlock, I think that's a bug in the implementation.

When waiting for merges to complete, I added a catch for InterruptedException that sets the interrupt flag and throws an IOException. The documented behavior of IndexWriter is to clear the interrupt flag and throw ThreadInterruptedException.

Again, not sure why the tests on master didn't fail. Maybe we just got lucky with the branch_8x tests.

[Legacy Jira: Michael Froh on Mar 02 2020]

mikemccand commented 4 years ago

I was able to reproduce the testMergeOnCommit failure on master sometimes with the following options:

-Dtestcase=TestIndexWriterMergePolicy -Dtests.method=testMergeOnCommit -Dtests.seed=F8DD5AD20994FDDF -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=fi-FI -Dtests.timezone=America/Danmarkshavn -Dtests.asserts=true -Dtests.file.encoding=UTF-8

[Legacy Jira: Michael Froh on Mar 02 2020]

mikemccand commented 4 years ago

Posted a PR with fixes for the above test failures: https://github.com/apache/lucene-solr/pull/1307

[Legacy Jira: Michael Froh on Mar 02 2020]

mikemccand commented 4 years ago

Thanks, @msfroh I'll take a look

[Legacy Jira: Michael Sokolov (@msokolov) on Mar 02 2020]

mikemccand commented 4 years ago

Commit f017ae465ec416b9cf5ac91f9aa12ff71abd7de0 in lucene-solr's branch refs/heads/master from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f017ae4

LUCENE-8962: Fix intermittent test failures (#1307)

TestIndexWriterMergePolicy.testMergeOnCommit will fail if the last commit (the one that should trigger the full merge) doesn't have any pending changes (which could occur if the last indexing thread commits at the end). We can fix that by adding one more document before that commit.
The previous implementation was throwing IOException if the commit thread gets interrupted while waiting for merges to complete. This violates IndexWriter's documented behavior of throwing ThreadInterruptedException.

[Legacy Jira: ASF subversion and git services on Mar 02 2020]

mikemccand commented 4 years ago

Commit f017ae465ec416b9cf5ac91f9aa12ff71abd7de0 in lucene-solr's branch refs/heads/master from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f017ae4

LUCENE-8962: Fix intermittent test failures (#1307)

TestIndexWriterMergePolicy.testMergeOnCommit will fail if the last commit (the one that should trigger the full merge) doesn't have any pending changes (which could occur if the last indexing thread commits at the end). We can fix that by adding one more document before that commit.
The previous implementation was throwing IOException if the commit thread gets interrupted while waiting for merges to complete. This violates IndexWriter's documented behavior of throwing ThreadInterruptedException.

[Legacy Jira: ASF subversion and git services on Mar 02 2020]

mikemccand commented 4 years ago

Commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a in lucene-solr's branch refs/heads/branch_8x from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a1791e7

LUCENE-8962: Add ability to selectively merge on commit (#1155)

LUCENE-8962: Add ability to selectively merge on commit

This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns.

If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns.

Add missing Javadoc
Fix incorrect comment
Refactoring and fix intermittent test failure

Made some changes to the callback to update toCommit, leveraging SegmentInfos.applyMergeChanges.
I realized that we'll never end up with 0 registered merges, because we throw an exception if we fail to register a merge.
Moved the IndexWriterEvents.beginMergeOnCommit notification to before we call MergeScheduler.merge, since we may not be merging on another thread.
There was an intermittent test failure due to randomness in the time it takes for merges to complete. Before doing the final commit, we wait for pending merges to finish. We may still end up abandoning the final merge, but we can detect that and assert that either the merge was abandoned (and we have > 1 segment) or we did merge down to 1 segment.

Fix typo
Fix/improve comments based on PR feedback
More comment improvements from PR feedback
Rename method and add new MergeTrigger

Renamed findCommitMerges -> findFullFlushMerges.
Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to MergeScheduler when merging on commit.

Update renamed method name in strings and comments

[Legacy Jira: ASF subversion and git services on Mar 02 2020]

mikemccand commented 4 years ago

Commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a in lucene-solr's branch refs/heads/branch_8x from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a1791e7

LUCENE-8962: Add ability to selectively merge on commit (#1155)

LUCENE-8962: Add ability to selectively merge on commit

This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns.

If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns.

Add missing Javadoc
Fix incorrect comment
Refactoring and fix intermittent test failure

Made some changes to the callback to update toCommit, leveraging SegmentInfos.applyMergeChanges.
I realized that we'll never end up with 0 registered merges, because we throw an exception if we fail to register a merge.
Moved the IndexWriterEvents.beginMergeOnCommit notification to before we call MergeScheduler.merge, since we may not be merging on another thread.
There was an intermittent test failure due to randomness in the time it takes for merges to complete. Before doing the final commit, we wait for pending merges to finish. We may still end up abandoning the final merge, but we can detect that and assert that either the merge was abandoned (and we have > 1 segment) or we did merge down to 1 segment.

Fix typo
Fix/improve comments based on PR feedback
More comment improvements from PR feedback
Rename method and add new MergeTrigger

Renamed findCommitMerges -> findFullFlushMerges.
Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to MergeScheduler when merging on commit.

Update renamed method name in strings and comments

[Legacy Jira: ASF subversion and git services on Mar 02 2020]

mikemccand commented 4 years ago

Commit a5475de57fed6b339cd5565bd1bd2650f265a537 in lucene-solr's branch refs/heads/branch_8x from Michael Froh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a5475de

LUCENE-8962: Fix intermittent test failures

TestIndexWriterMergePolicy.testMergeOnCommit will fail if the last commit (the one that should trigger the full merge) doesn't have any pending changes (which could occur if the last indexing thread commits at the end). We can fix that by adding one more document before that commit.
The previous implementation was throwing IOException if the commit thread gets interrupted while waiting for merges to complete. This violates IndexWriter's documented behavior of throwing ThreadInterruptedException.

[Legacy Jira: ASF subversion and git services on Mar 02 2020]

mikemccand commented 4 years ago

Commit fdac6d866344611290c45c164112277581328bc9 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fdac6d8

Add CHANGES entry for LUCENE-8962

[Legacy Jira: ASF subversion and git services on Mar 02 2020]

mikemccand commented 4 years ago

Commit e308e538731f392eb81ba81cfb3ec5fc526fd383 in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e308e53

Add CHANGES entry for LUCENE-8962

[Legacy Jira: ASF subversion and git services on Mar 02 2020]

mikemccand commented 4 years ago

pushed to master, and also backported to branch8x

[Legacy Jira: Michael Sokolov (@msokolov) on Mar 02 2020]

mikemccand commented 4 years ago

There are still test fails being reported by jenkins. Here's one that reproduced for me:

[repro] Repro line: ant test -Dtestcase=TestIndexWriterMergePolicy -Dtests.method=testMergeOnCommit -Dtests.seed=33907689C9809E73 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH -Dtests.timezone=Africa/Banjul -Dtests.asserts=true -Dtests.file.encoding=UTF-8

[Legacy Jira: Michael Sokolov (@msokolov) on Mar 03 2020]

mikemccand commented 4 years ago

I ended up splitting testMergeOnCommit into two test cases.

One runs through the basic invariants on a single thread and confirms that everything behaves as expected.

The other tries indexing and committing from multiple threads, but doesn't really make any assumptions about the segment topology in the end (since randomness and concurrency can lead to all kinds of possible valid segment counts). Instead it just verifies that it doesn't fail and doesn't lose any documents.

https://github.com/apache/lucene-solr/pull/1313

[Legacy Jira: Michael Froh on Mar 03 2020]

mikemccand commented 4 years ago

FYI this test seems to fail too because of this change:

NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterExceptions2 -Dtests.method=testBasics -Dtests.seed=1F5EDA9C353C018F -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=kkj-CM -Dtests.timezone=Etc/UCT -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1

[Legacy Jira: Adrien Grand (@jpountz) on Mar 04 2020]

mikemccand commented 4 years ago

I think I'll test and push @msfroh's test fix since it seems to fail frequently, while we investigate this new failure. It looks as if we are not properly handling the case when an IOException is thrown at some inopportune time?

[Legacy Jira: Michael Sokolov (@msokolov) on Mar 04 2020]

mikemccand commented 4 years ago

It's not immediately obvious to me how to fix the failure on TestIndexWriterExceptions2.

A merge on commit fails (because it's using CrankyCodec), closing the merge readers, which calls the custom mergeFinished override, which assumes the merge completed (since it wasn't aborted), and tries to reference the files for the merged segment (to increment their reference counts). That triggers an IllegalStateException because the files weren't set (because we didn't get that far in the merge).

Unfortunately, stepping through the debugger, I don't see a clear way of telling in mergeFinished that a merge failed. Obviously, I could wrap the call to SegmentCommitInfo.files() in a try-catch, and assume that the IllegalStateException means that the merge failed, but that would fail to properly handle the case where, say, an IOException occurred when committing the merge (after SegmentInfo.setFiles() was called, but before the files were actually written to disk).

I'm thinking of adding a boolean field to OneMerge that gets set once a merge is successfully committed (e.g. just before the call to closeMergeReaders in IndexWriter.commitMerge()), which the mergeFinished override can use to determine if the merge completed successfully or not.

[Legacy Jira: Michael Froh on Mar 04 2020]

mikemccand commented 4 years ago

I updated https://github.com/apache/lucene-solr/pull/1313 with that proposed fix (adding a boolean field to OneMerge that gets set once a merge is successfully committed).

[Legacy Jira: Michael Froh on Mar 04 2020]

mikemccand commented 4 years ago

Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207

LUCENE-8962: Split test case (#1313)

LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.

LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.

[Legacy Jira: ASF subversion and git services on Mar 05 2020]

mikemccand commented 4 years ago

Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207

LUCENE-8962: Split test case (#1313)

LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.

LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.

[Legacy Jira: ASF subversion and git services on Mar 05 2020]

mikemccand commented 4 years ago

Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207

LUCENE-8962: Split test case (#1313)

LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.

LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.

[Legacy Jira: ASF subversion and git services on Mar 05 2020]

mikemccand commented 4 years ago

Commit e5be034df2fc22f1b88e4d271b25c8fae1c3093f in lucene-solr's branch refs/heads/master from Michael McCandless https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e5be034

LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)

[Legacy Jira: ASF subversion and git services on Mar 06 2020]

mikemccand commented 4 years ago

Commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8 in lucene-solr's branch refs/heads/branch_8x from Michael McCandless https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3dbfd10

LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)

[Legacy Jira: ASF subversion and git services on Mar 06 2020]

mikemccand commented 4 years ago

I read through this issue and I want to share some of my thoughts. First, I understand the need for this and the motivation, yet every time we add something like this to the IndexWriter to do something as part of another method it triggers an alarm on my end. I have spent hours and days thinking about how IW can be simpler and the biggest issues that I see is that the primitives on IW like commit or openReader are doing too much. Just look at openReader it's pretty involved and changing the bus factor or making it easier to understand is hard. Adding stuff like wait for merge with something like a timeout is not what I think we should do neither to openReader nor to commit.
That said, I think we can make the same things happen but we should think in primitives rather than changing method behavior with configuration. Let me explain what I mean:

Lets say we keep commit and openReader the way it is and would instead allow to use an existing reader NRT or not and allow itself to optimize itself (yeah I said that - it might be a good name after all). With a slightly refactored IW we can share the merge logic and let the reader re-write itself since we are talking about very small segments the overhead is very small. This would in turn mean that we are doing the work twice ie. the IW would do its normal work and might merge later etc. We might even merge this stuff into heap-space or so if we have enough I haven't thought too much about that. This way we can clean up IW potentially and add a very nice optimization that works for commit as well as NRT. We should strive for making IW simpler not do more. I hope I wasn't too discouraging.

[Legacy Jira: Simon Willnauer (@s1monw) on Mar 06 2020]

mikemccand commented 4 years ago

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5

LUCENE-8962: Split test case (#1313)

LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.

LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.

[Legacy Jira: ASF subversion and git services on Mar 06 2020]

mikemccand commented 4 years ago

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5

LUCENE-8962: Split test case (#1313)

LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.

LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.

[Legacy Jira: ASF subversion and git services on Mar 06 2020]

mikemccand commented 4 years ago

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5

LUCENE-8962: Split test case (#1313)

LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.

LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.

[Legacy Jira: ASF subversion and git services on Mar 06 2020]

mikemccand / stargazers-migration-test

Can we merge small segments during refresh, for faster searching? [LUCENE-8962] #959

Legacy Jira details