Closed mikemccand closed 4 years ago
At Salesforce I worked on a custom merge policy to better address handling of small segments than TieredMergePolicy's choices. What's disappointing about TMP is that TMP insists on merging getSegmentsPerTier() (10) segments, even when they are small (below getFloorSegmentMB()). Instead we wanted some "cheap merges" of a smaller number of segments (even as few as 3 for us) that solely consist of the small segments. This cut our average segment count in half, although cost us more I/O – a trade-off we were happy with. I'd like to open-source this, perhaps as a direct change to TMP with defaults to do a similar amount of I/O but averaging fewer segments. The difficult part is doing simulations to prove out the theories.
Additionally, I worked on a custom MergeScheduler that executed those "cheap merges" synchronously (directly in the calling thread) while having the regular other merges pass through to the concurrent scheduler. The rationale wasn't tied to NRT but I could see NRT benefiting from this if getting an NRT searcher calls out to the merge code (I don't know if it does).
Perhaps your use-case could benefit from this as well. Unlike what you propose in the description, it doesn't involve changes/features to Lucene itself. WDYT?
[Legacy Jira: David Smiley (@dsmiley) on Sep 04 2019]
Thanks @dsmiley.
That sounds like a nice improvement to TMP, but I would want to match the more aggressive merging of those tiny segments w/ a refresh or commit, not run them in general since I think for pure indexing that'd hurt indexing throughput. I think the tricky part of this change is fixing IndexWriter
refresh or commit to let the merge policy know it should now aggressively merge small segments, within a time or total size budget or something, while the refresh/commit operation waits, so that the returned segments have been merged, even while (concurrently) new segments are flushed.
Synchronous merges in merge scheduler sound interesting for this use case – maybe open a separate issue for that?
[Legacy Jira: Michael McCandless (@mikemccand) on Sep 04 2019]
The existing MergeTrigger mechanism is a way to conditionally activate this stuff. We already did that; I forgot to mention it. Thus it's only on a commit (not a flush) that activates this behavior. I'll create an issue for the merge scheduler and link here.
[Legacy Jira: David Smiley (@dsmiley) on Sep 04 2019]
I ended up needing something like this, not for NRT readers, but rather on commit.
I added a mechanism to compute cheap "commit merges" from within the prepareCommitInternal() call, and block until they complete (updating the "toCommit" SegmentInfos as they finish). I posted a PR for that here: https://github.com/apache/lucene-solr/pull/1155
I think we could do something similar from IndexWriter.getReader() to handle the NRT case, but I haven't tried working on that yet.
[Legacy Jira: Michael Froh on Jan 08 2020]
Here's a before and after comparison of the average number of segments searched per request since I applied this change (with a TieredMergePolicy subclass that tries to merge all segments smaller than 100MB into a single segment on commit, with floorSegmentMB of 500). It lowers the overall count, but especially significantly reduced the variance.
[Legacy Jira: Michael Froh on Jan 19 2020]
Thanks [~msokolov@gmail.com] for the feedback on the PR! I've updated it to incorporate your suggestions.
[Legacy Jira: Michael Froh on Jan 28 2020]
@msfroh as you can see above, I accomplished the effect here already in a different way without modifying Lucene. Not that I think we shouldn't modify Lucene altogether but I think the changes can be limited to implementations of MergePolicy & MergeScheduler without needing to modify the abstractions themselves or core Lucene, which are already sufficient. See LUCENE-8331 for a benchmark utility. I should resume this work.
[Legacy Jira: David Smiley (@dsmiley) on Jan 28 2020]
I finally got around to contributing the MergePolicy I developed for Salesforce and I linked it to this issue. I named it "EagerCheapMergePolicy" for the contribution. See the linked PR https://github.com/apache/lucene-solr/pull/1222 It makes no changes to Lucene internals. This component is one of two necessary pieces; the other is the merge scheduler that I already referred to.
Maybe we should debate here/now the merits of changing Lucene's APIs vs not. I'm showing here that it's not necessary; the APIs are sufficiently expressive to plug in and accomplish this feature, and it seems straight-forward to me (not a hack).
[Legacy Jira: David Smiley (@dsmiley) on Jan 29 2020]
@dsmiley IW normally kicks of merges after flushing and committing new segments, so I don't see how a merge policy can actually reduce the committed segment count?
Or, do you require the caller to call commit, waitForMerges, commit again, when using this policy? (And to not use any concurrent indexing threads)?
[Legacy Jira: Michael McCandless (@mikemccand) on Jan 30 2020]
Woah; I defer to your expertise @mikemccand.
IndexWriter has a huge bus factor and I haven't delved into it. Still... I want to confirm what I think you are telling me. Based on my understanding of when merges are triggered and observed (by a reader/searcher), I wrote the following test on TestIndexWriterMergePolicy
:
public void testMergeOnCommitIsSearchable() throws IOException {
try (
Directory dir = newDirectory();
IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random()))
.setMaxBufferedDocs(10)
.setMergePolicy(new LogDocMergePolicy())
.setMergeScheduler(new SerialMergeScheduler()))
) {
for (int i = 0; i < 99; i++) {
addDoc(writer);
checkInvariants(writer);
}
assertEquals(9, writer.getSegmentCount());
assertEquals(9, writer.getNumBufferedDocuments());
writer.commit();
try (DirectoryReader reader = DirectoryReader.open(writer)) {
assertEquals(1, reader.getSequentialSubReaders().size());
}
}
}
Generally speaking and scene here, after a commit, a search application will open a new reader to be able to search over the recently committed documents. In the scenario above, I index a bunch of documents, some of which have been flushed already, some pending. Also notice the SerialMergeScheduler so that the writing thread merges in-process / synchronously. Then see I open a NRT reader from the writer and count the segments. I get 1, because the flushed buffer will produce the 10th segment and the configured LogDocMergePolicy will merge altogether on 10.
The test passes. So; am I missing something?
[Legacy Jira: David Smiley (@dsmiley) on Jan 30 2020]
@dsmiley – in your test, the merge executes after the commit updates the IndexWriter
's live SegmentInfos
. When you call DirectoryReader.open
, it takes another clone of that live SegmentInfos (which has 1 segment).
However, the clone of the SegmentInfos
that was written in the commit is from before the merge. If you were to open a fresh DirectoryReader
from the on-disk directory, I believe you would still see 9 segments.
With the approach I took, the cheap merge (or merges) asynchronously updates the commit's SegmentInfos
clone before the commit happens.
[Legacy Jira: Michael Froh on Jan 30 2020]
IndexWriter has a huge bus factor
You mean small bus factor!
and I haven't delved into it.
You should dive into IW, to increase the bus factor.
The test passes. So; am I missing something?
Well, using SerialMergeScheduler
allows the test to pass, since the merges kicked off due to new segments after the commit will run, synchronously (using the main thread in your test) to completion. And then you open a new NRT IndexReader
directly from IndexWriter
that sees only the one merged segment. If you make the test more realistic (use concurrent indexing threads, ConcurrentMergeScheduler
), the assertion should fail. Or, if you opened the IndexReader
from Directory
instead, it should also fail.
I think in order to see the actual committed SegmentInfos
reflect the "cheap" merges, we need to take an approach similar to @msfroh
's PR.
[Legacy Jira: Michael McCandless (@mikemccand) on Jan 31 2020]
Indeed, I meant "small bus factor" :) I'm looking deeper at Michael Froh's PR this weekend, and by necessity the pertinent parts of IndexWriter.
Well, using
SerialMergeScheduler
allows the test to pass, since the merges kicked off due to new segments after the commit will run, synchronously (using the main thread in your test) to completion.
Yes, it's not "realistic", but my objective in this code snippet was merely to demonstrate that the combination of a merge policy and a merge scheduler have the ability to affect the searchable segments on commit when using the NRT Reader/Searcher. Apparently it doesn't work if a normal (non-NRT Reader/Searcher) is opened; I can see that. Maybe this is a shortcoming of IndexWriter; why shouldn't IW be consistent on this matter?
It's not apparent to me that we need a new method on the MergePolicy when the MergeTrigger parameter is able to differentiate the types of merges so that a MP is able to behave differently depending on the circumstance. Am I unclear on this?
[Legacy Jira: David Smiley (@dsmiley) on Feb 01 2020]
Yes, it's not "realistic", but my objective in this code snippet was merely to demonstrate that the combination of a merge policy and a merge scheduler have the ability to affect the searchable segments on commit when using the NRT Reader/Searcher.
OK indeed you are right – that particular combination will in fact make the "merged after commit" segments visible to the subsequent searchers.
Apparently it doesn't work if a normal (non-NRT Reader/Searcher) is opened; I can see that. Maybe this is a shortcoming of IndexWriter; why shouldn't IW be consistent on this matter?
I don't think this is a shortcoming of IW. Rather, this is the salient difference between non-NRT and NRT readers – the latter get to see the "latest" in-memory segments changes in IndexWriter
while the former see only precisely what was last committed.
It's not apparent to me that we need a new method on the MergePolicy when the MergeTrigger parameter is able to differentiate the types of merges so that a MP is able to behave differently depending on the circumstance. Am I unclear on this?
Yeah I think you are right! That would be a nice simplification. Probably this can just be folded into the existing MergePolicy
API as a different MergeTrigger
. Though then I wonder why e.g. forceMerge
or expungeDeletes
are not also simply different triggers ... @msfroh
what do you think?
[Legacy Jira: Michael McCandless (@mikemccand) on Feb 01 2020]
Yeah I think you are right! That would be a nice simplification. Probably this can just be folded into the existing MergePolicy API as a different MergeTrigger. Though then I wonder why e.g. forceMerge or expungeDeletes are not also simply different triggers ... Michael Froh what do you think?
As I was first writing this, I added a MergeTrigger.COMMIT
value and used that, rather than adding a dedicated method.
Then I realized that any time I've ever written a custom implementation of MergePolicy.findMerges()
, I've ignored the MergeTrigger
value, because I didn't really care what triggered the merge – I just wanted to define the MergeSpecification
. Even TieredMergePolicy.findMerges()
} doesn't look at the MergeTrigger
parameter.
If I had made IndexWriter
call findMerges
with a MergeTrigger.COMMIT
trigger, anyone with a similar MergePolicy
would have probably ended up running (and blocking on) some pretty expensive merges on commit. The best way I could think of to be backwards compatible with the "old" behavior by default was to add a no-op method to the base class.
Looking through the history, it looks like forceMerge
and expungeDeletes
predate MergeTrigger
, so that could explain them.
I really like the idea of controlling this with a MergeTrigger
, but I'm concerned about breaking existing MergePolicy
implementations that ignore the MergeTrigger
(which I suspect may be most of them).
[Legacy Jira: Michael Froh on Feb 03 2020]
Ahh that's a very good point @msfroh
– +1 to add a new method for better back compat.
[Legacy Jira: Michael McCandless (@mikemccand) on Feb 04 2020]
+1 makes sense; I like the explicitness of the feature. It's a shame that in prepareCommitInternal, already a very complex method, suddenly is much more complex. This seems par for the course in IndexWriter that demands you look at it in a monitor in portrait mode (thankfully I have one).
I think IndexWriter would benefit from all merge-related implementation going off to a helper class (package accessible). It would help mere mortals digest the complexity of IndexWriter. Congrats Michael Froh for figuring this gnarly stuff out!
[Legacy Jira: David Smiley (@dsmiley) on Feb 05 2020]
Thanks @dsmiley and @msfroh
! I think this is ready? I'll work on merging the PR soon ...
[Legacy Jira: Michael McCandless (@mikemccand) on Feb 18 2020]
Commit 043c5dff6f44c9bb2415005ac97db3c2c561ab45 in lucene-solr's branch refs/heads/master from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=043c5df
LUCENE-8962: Add ability to selectively merge on commit (#1155)
This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns.
If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns.
Add missing Javadoc
Fix incorrect comment
Refactoring and fix intermittent test failure
Fix typo
Fix/improve comments based on PR feedback
More comment improvements from PR feedback
Rename method and add new MergeTrigger
[Legacy Jira: ASF subversion and git services on Mar 02 2020]
Commit 043c5dff6f44c9bb2415005ac97db3c2c561ab45 in lucene-solr's branch refs/heads/master from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=043c5df
LUCENE-8962: Add ability to selectively merge on commit (#1155)
This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns.
If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns.
Add missing Javadoc
Fix incorrect comment
Refactoring and fix intermittent test failure
Fix typo
Fix/improve comments based on PR feedback
More comment improvements from PR feedback
Rename method and add new MergeTrigger
[Legacy Jira: ASF subversion and git services on Mar 02 2020]
Tests seemed to pass consistently on master, so I pushed there, but now I cherry-picked to 8x branch and see consistent failures there:
[junit4] Tests with failures [seed: 5036466712CCD5FE]: [junit4] - org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock [junit4] - org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeOnCommit
[Legacy Jira: Michael Sokolov (@msokolov) on Mar 02 2020]
I'm looking into the branch_8x
failures.
I'm able to reproduce on my machine and will step through to see what's different.
[Legacy Jira: Michael Froh on Mar 02 2020]
I think the failure in testMergeOnCommit
occurs because of a difference in the random behavior of the test.
Specifically, sometimes the last writing thread happens to choose to commit()
at the end, so there are no pending changes by the time we do the last commit()
which should merge all segments (or abandon the merge, if it takes too long).
If we add one more doc before that last commit (ensuring that the anyChanges
check in IndexWriter.prepareCommitInternal()
is true
), the test passes consistently.
I'm not sure why we don't see the same failure sometimes on master, though.
[Legacy Jira: Michael Froh on Mar 02 2020]
Regarding TestIndexWriter.testThreadInterruptDeadlock
, I think that's a bug in the implementation.
When waiting for merges to complete, I added a catch
for InterruptedException
that sets the interrupt flag and throws an IOException
. The documented behavior of IndexWriter
is to clear the interrupt flag and throw ThreadInterruptedException
.
Again, not sure why the tests on master didn't fail. Maybe we just got lucky with the branch_8x tests.
[Legacy Jira: Michael Froh on Mar 02 2020]
I was able to reproduce the testMergeOnCommit
failure on master sometimes with the following options:
-Dtestcase=TestIndexWriterMergePolicy -Dtests.method=testMergeOnCommit -Dtests.seed=F8DD5AD20994FDDF -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=fi-FI -Dtests.timezone=America/Danmarkshavn -Dtests.asserts=true -Dtests.file.encoding=UTF-8
[Legacy Jira: Michael Froh on Mar 02 2020]
Posted a PR with fixes for the above test failures: https://github.com/apache/lucene-solr/pull/1307
[Legacy Jira: Michael Froh on Mar 02 2020]
Thanks, @msfroh
I'll take a look
[Legacy Jira: Michael Sokolov (@msokolov) on Mar 02 2020]
Commit f017ae465ec416b9cf5ac91f9aa12ff71abd7de0 in lucene-solr's branch refs/heads/master from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f017ae4
LUCENE-8962: Fix intermittent test failures (#1307)
[Legacy Jira: ASF subversion and git services on Mar 02 2020]
Commit f017ae465ec416b9cf5ac91f9aa12ff71abd7de0 in lucene-solr's branch refs/heads/master from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f017ae4
LUCENE-8962: Fix intermittent test failures (#1307)
[Legacy Jira: ASF subversion and git services on Mar 02 2020]
Commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a in lucene-solr's branch refs/heads/branch_8x from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a1791e7
LUCENE-8962: Add ability to selectively merge on commit (#1155)
This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns.
If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns.
Add missing Javadoc
Fix incorrect comment
Refactoring and fix intermittent test failure
Fix typo
Fix/improve comments based on PR feedback
More comment improvements from PR feedback
Rename method and add new MergeTrigger
[Legacy Jira: ASF subversion and git services on Mar 02 2020]
Commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a in lucene-solr's branch refs/heads/branch_8x from msfroh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a1791e7
LUCENE-8962: Add ability to selectively merge on commit (#1155)
This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns.
If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns.
Add missing Javadoc
Fix incorrect comment
Refactoring and fix intermittent test failure
Fix typo
Fix/improve comments based on PR feedback
More comment improvements from PR feedback
Rename method and add new MergeTrigger
[Legacy Jira: ASF subversion and git services on Mar 02 2020]
Commit a5475de57fed6b339cd5565bd1bd2650f265a537 in lucene-solr's branch refs/heads/branch_8x from Michael Froh https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a5475de
LUCENE-8962: Fix intermittent test failures
[Legacy Jira: ASF subversion and git services on Mar 02 2020]
Commit fdac6d866344611290c45c164112277581328bc9 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fdac6d8
Add CHANGES entry for LUCENE-8962
[Legacy Jira: ASF subversion and git services on Mar 02 2020]
Commit e308e538731f392eb81ba81cfb3ec5fc526fd383 in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e308e53
Add CHANGES entry for LUCENE-8962
[Legacy Jira: ASF subversion and git services on Mar 02 2020]
pushed to master, and also backported to branch8x
[Legacy Jira: Michael Sokolov (@msokolov) on Mar 02 2020]
There are still test fails being reported by jenkins. Here's one that reproduced for me:
[repro] Repro line: ant test -Dtestcase=TestIndexWriterMergePolicy -Dtests.method=testMergeOnCommit -Dtests.seed=33907689C9809E73 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH -Dtests.timezone=Africa/Banjul -Dtests.asserts=true -Dtests.file.encoding=UTF-8
[Legacy Jira: Michael Sokolov (@msokolov) on Mar 03 2020]
I ended up splitting testMergeOnCommit into two test cases.
One runs through the basic invariants on a single thread and confirms that everything behaves as expected.
The other tries indexing and committing from multiple threads, but doesn't really make any assumptions about the segment topology in the end (since randomness and concurrency can lead to all kinds of possible valid segment counts). Instead it just verifies that it doesn't fail and doesn't lose any documents.
https://github.com/apache/lucene-solr/pull/1313
[Legacy Jira: Michael Froh on Mar 03 2020]
FYI this test seems to fail too because of this change:
NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterExceptions2 -Dtests.method=testBasics -Dtests.seed=1F5EDA9C353C018F -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=kkj-CM -Dtests.timezone=Etc/UCT -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
[Legacy Jira: Adrien Grand (@jpountz) on Mar 04 2020]
I think I'll test and push @msfroh
's test fix since it seems to fail frequently, while we investigate this new failure. It looks as if we are not properly handling the case when an IOException is thrown at some inopportune time?
[Legacy Jira: Michael Sokolov (@msokolov) on Mar 04 2020]
It's not immediately obvious to me how to fix the failure on TestIndexWriterExceptions2
.
A merge on commit fails (because it's using CrankyCodec
), closing the merge readers, which calls the custom mergeFinished
override, which assumes the merge completed (since it wasn't aborted), and tries to reference the files for the merged segment (to increment their reference counts). That triggers an IllegalStateException
because the files weren't set (because we didn't get that far in the merge).
Unfortunately, stepping through the debugger, I don't see a clear way of telling in mergeFinished
that a merge failed. Obviously, I could wrap the call to SegmentCommitInfo.files()
in a try-catch, and assume that the IllegalStateException
means that the merge failed, but that would fail to properly handle the case where, say, an IOException occurred when committing the merge (after SegmentInfo.setFiles()
was called, but before the files were actually written to disk).
I'm thinking of adding a boolean
field to OneMerge
that gets set once a merge is successfully committed (e.g. just before the call to closeMergeReaders
in IndexWriter.commitMerge()
), which the mergeFinished
override can use to determine if the merge completed successfully or not.
[Legacy Jira: Michael Froh on Mar 04 2020]
I updated https://github.com/apache/lucene-solr/pull/1313 with that proposed fix (adding a boolean
field to OneMerge that gets set once a merge is successfully committed).
[Legacy Jira: Michael Froh on Mar 04 2020]
Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207
LUCENE-8962: Split test case (#1313)
The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.
Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.
The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.
[Legacy Jira: ASF subversion and git services on Mar 05 2020]
Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207
LUCENE-8962: Split test case (#1313)
The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.
Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.
The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.
[Legacy Jira: ASF subversion and git services on Mar 05 2020]
Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207
LUCENE-8962: Split test case (#1313)
The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.
Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.
The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.
[Legacy Jira: ASF subversion and git services on Mar 05 2020]
Commit e5be034df2fc22f1b88e4d271b25c8fae1c3093f in lucene-solr's branch refs/heads/master from Michael McCandless https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e5be034
LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)
[Legacy Jira: ASF subversion and git services on Mar 06 2020]
Commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8 in lucene-solr's branch refs/heads/branch_8x from Michael McCandless https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3dbfd10
LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)
[Legacy Jira: ASF subversion and git services on Mar 06 2020]
I read through this issue and I want to share some of my thoughts. First, I understand the need for this and the motivation, yet every time we add something like this to the IndexWriter to do something as part of another method it triggers an alarm on my end. I have spent hours and days thinking about how IW can be simpler and the biggest issues that I see is that the primitives on IW like commit or openReader are doing too much. Just look at openReader it's pretty involved and changing the bus factor or making it easier to understand is hard. Adding stuff like wait for merge with something like a timeout is not what I think we should do neither to openReader nor to commit.
That said, I think we can make the same things happen but we should think in primitives rather than changing method behavior with configuration. Let me explain what I mean:
Lets say we keep commit and openReader the way it is and would instead allow to use an existing reader NRT or not and allow itself to optimize itself (yeah I said that - it might be a good name after all). With a slightly refactored IW we can share the merge logic and let the reader re-write itself since we are talking about very small segments the overhead is very small. This would in turn mean that we are doing the work twice ie. the IW would do its normal work and might merge later etc. We might even merge this stuff into heap-space or so if we have enough I haven't thought too much about that. This way we can clean up IW potentially and add a very nice optimization that works for commit as well as NRT. We should strive for making IW simpler not do more. I hope I wasn't too discouraging.
[Legacy Jira: Simon Willnauer (@s1monw) on Mar 06 2020]
Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5
LUCENE-8962: Split test case (#1313)
The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.
Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.
The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.
[Legacy Jira: ASF subversion and git services on Mar 06 2020]
Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5
LUCENE-8962: Split test case (#1313)
The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.
Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.
The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.
[Legacy Jira: ASF subversion and git services on Mar 06 2020]
Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5
LUCENE-8962: Split test case (#1313)
The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once.
Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown.
The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos.
[Legacy Jira: ASF subversion and git services on Mar 06 2020]
Two improvements were added: 8.6 has merge-on-commit (by Froh et. all), 8.7 has merge-on-refresh (by Simon). See `MergePolicy.findFullFlushMerges`
The original description follows:
With near-real-time search we ask
IndexWriter
to write all in-memory segments to disk and open anIndexReader
to search them, and this is typically a quick operation.However, when you use many threads for concurrent indexing,
IndexWriter
will accumulate write many small segments duringrefresh
and this then adds search-time cost as searching must visit all of these tiny segments.The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve `IndexWriter'`s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader? It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ...
One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ...
I'm not yet sure how best to solve this, so I wanted to open an issue for discussion!
Legacy Jira details
LUCENE-8962 by Michael McCandless (@mikemccand) on Sep 03 2019, resolved Jun 27 2020 Attachments: failed-tests.patch, failure_log.txt, image-2021-03-22-10-36-32-201.png, LUCENE-8962_demo.png, test.diff Linked issues: