Open jainankitk opened 1 year ago
Hey @jainankitk, thanks for reporting this. I would like to know more how OpenSearch will handle deletes if you disabled soft-deletes. My understanding soft-delete is required because of the immutable nature of Lucene segment files.
@nknize any thoughts?
@nknize any thoughts?
Yes, It was a conscious decision to reduce storage cost and improve performance of peer recoveries and I have yet to see a detailed reproducible scenario w/ the open source OpenSearch distribution to substantiate the need. So I don't suggest we do this without strong justification. Even then I'd look at other mechanisms before unilaterally rolling back mandatory soft deletes.
Enabling soft deletes can leave segments with large number of deleted documents impacting search performance.
@jainankitk can you post a reproducible benchmark including segment sizes, geometries, merge policy, etc. for further investigation? Full details are needed to determine this isn't a red herring request.
Out of curiosity what is the behavior when you lower index.soft_deletes.retention_lease.period
. This controls how long deletes are retained before being merged away. In the scenario you're experiencing this behavior, did you tracelog the merge before / after adjusting the retention lease? What's the segment geometry look like? Can you provide specific tracelog / cat API output clearly showing the causality.
removing bug
label since this is related to a performance concern / discussion.
Reporting a slightly different side effect (regression) due to soft deletes on indexing latency.
Observed that p100 indexing latency goes as high as 10+ seconds with soft deletes enabled where as the p100 latency was approx. 300 ms with soft deletes disabled. Thread dump revealed that the write thread is blocked on the IndexWriter object which is held by the refresh thread. The thread dumps are from OpenSearch 1.3 but I validated the same issue exists in latest OpenSearch version (2.11)
"opensearch[027d7adff6d2636bf859b1546e6ca1f9][write][T#48]" #596 daemon prio=5 os_prio=0 cpu=644849.64ms elapsed=166954.02s tid=0x00007f29dc160830 nid=0x71f waiting for monitor entry [0x00007f23076f6000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.lucene.index.IndexWriter.getNextMerge(IndexWriter.java:2365)
- waiting to lock <0x00000010ea798fb0> (a org.apache.lucene.index.IndexWriter)
at org.apache.lucene.index.IndexWriter.access$800(IndexWriter.java:219)
at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.getNextMerge(IndexWriter.java:5898)
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:533)
- locked <0x0000001296256ed0> (a org.opensearch.index.engine.InternalEngine$EngineMergeScheduler)
at org.apache.lucene.index.IndexWriter.executeMerge(IndexWriter.java:2293)
"opensearch[027d7adff6d2636bf859b1546e6ca1f9][refresh][T#1]" #501 daemon prio=5 os_prio=0 cpu=123408204.42ms elapsed=167420.28s tid=0x00007f2bc00076d0 nid=0x94b7 runnable [0x00007f2891cde000]
java.lang.Thread.State: RUNNABLE
at org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer.writeValues(Lucene80DocValuesConsumer.java:198)
at org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer.addNumericField(Lucene80DocValuesConsumer.java:126)
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addNumericField(PerFieldDocValuesFormat.java:109)
at org.apache.lucene.index.ReadersAndUpdates.handleDVUpdates(ReadersAndUpdates.java:367)
- locked <0x00000012320bf4c0> (a org.apache.lucene.index.ReadersAndUpdates)
at org.apache.lucene.index.ReadersAndUpdates.writeFieldUpdates(ReadersAndUpdates.java:579)
- locked <0x00000012320bf4c0> (a org.apache.lucene.index.ReadersAndUpdates)
at org.apache.lucene.index.ReaderPool.writeAllDocValuesUpdates(ReaderPool.java:228)
at org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3629)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:624)
- locked <0x00000010ea798fb0> (a org.apache.lucene.index.IndexWriter)
- locked <0x0000001260d37da8> (a java.lang.Object)
at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:293)
Further inspection revealed that the refresh thread is writing doc values for soft delete field.
method=org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer.writeValues location=AtExit
ts=2023-10-16 10:33:41; [cost=2398.32054ms] result=@ArrayList[
@Object[][
@FieldInfo[
name=@String[__soft_deletes],
number=@Integer[18],
docValuesType=@DocValuesType[
Digging further on why doc value writes taking time, it showed min, max and gcd computation for the soft delete field consuming the time. Though the soft delete field value is hard coded to 1 in https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/common/lucene/Lucene.java#L980 it still consumes time as it iterates over million of documents.
`---ts=2023-10-16 09:33:45;thread_name=opensearch[7cdd429cbfcbc2619f34618050b2f9cf][refresh][T#1];id=de;is_daemon=true;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@5c29bfd
`---[83073.891311ms] org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer:writeValues()
+---[0.00% 0.023908ms ] org.apache.lucene.codecs.DocValuesProducer:getSortedNumeric() #170
+---[0.00% 5.92E-4ms ] org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer$MinMaxTracker:<init>() #172
+---[0.00% 6.97E-4ms ] org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer$MinMaxTracker:<init>() #173
+---[8.20% min=3.55E-4ms,max=16.545724ms,total=6809.57583ms,count=15417420] org.apache.lucene.index.SortedNumericDocValues:nextDoc() #176
+---[7.71% min=3.4E-4ms,max=17.220172ms,total=6406.568143ms,count=15417419] org.apache.lucene.index.SortedNumericDocValues:docValueCount() #177
+---[7.76% min=3.44E-4ms,max=16.878148ms,total=6446.381854ms,count=15417419] org.apache.lucene.index.SortedNumericDocValues:nextValue() #178
+---[8.39% min=3.69E-4ms,max=17.628196ms,total=6971.536636ms,count=15417419] org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer$MinMaxTracker:update() #191
+---[8.34% min=3.69E-4ms,max=16.76535ms,total=6928.286565ms,count=15417419] org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer$MinMaxTracker:update() #192
+---[7.28% min=3.2E-4ms,max=17.074409ms,total=6047.507578ms,count=15417418] org.apache.lucene.util.MathUtil:gcd() #187
+---[0.00% min=5.13E-4ms,max=0.00148ms,total=0.682019ms,count=941] org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer$MinMaxTracker:nextBlock() #194
+---[0.00% 8.53E-4ms ] org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer$MinMaxTracker:finish() #207
+---[0.00% 6.76E-4ms ] org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer$MinMaxTracker:finish() #208
+---[0.00% 9.59E-4ms ] org.apache.lucene.store.IndexOutput:getFilePointer() #226
+---[0.00% 0.002238ms ] org.apache.lucene.store.IndexOutput:writeLong() #227
+---[0.00% 0.008232ms ] org.apache.lucene.codecs.DocValuesProducer:getSortedNumeric() #228
+---[0.41% 342.714972ms ] org.apache.lucene.codecs.lucene80.IndexedDISI:writeBitSet() #229
+---[0.00% 6.18E-4ms ] org.apache.lucene.store.IndexOutput:getFilePointer() #230
+---[0.00% 0.001522ms ] org.apache.lucene.store.IndexOutput:writeLong() #230
+---[0.00% 5.47E-4ms ] org.apache.lucene.store.IndexOutput:writeShort() #231
+---[0.00% 7.11E-4ms ] org.apache.lucene.store.IndexOutput:writeByte() #232
+---[0.00% 7.69E-4ms ] org.apache.lucene.store.IndexOutput:writeLong() #235
+---[0.00% 0.0011ms ] org.apache.lucene.store.IndexOutput:writeInt() #241
+---[0.00% 5.87E-4ms ] org.apache.lucene.store.IndexOutput:writeByte() #277
+---[0.00% 6.13E-4ms ] org.apache.lucene.store.IndexOutput:writeLong() #278
+---[0.00% 6.04E-4ms ] org.apache.lucene.store.IndexOutput:writeLong() #279
+---[0.00% 5.44E-4ms ] org.apache.lucene.store.IndexOutput:getFilePointer() #280
+---[0.00% 6.09E-4ms ] org.apache.lucene.store.IndexOutput:writeLong() #281
+---[0.00% 5.28E-4ms ] org.apache.lucene.store.IndexOutput:getFilePointer() #288
+---[0.00% 5.99E-4ms ] org.apache.lucene.store.IndexOutput:writeLong() #288
`---[0.00% 5.77E-4ms ] org.apache.lucene.store.IndexOutput:writeLong() #289
As soft delete field is NumericDocValue field (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/common/lucene/Lucene.java#L980), each doc value write with soft delete field computes the min, max and gcd for all the soft delete documents in the segment (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java#L203-L230) and hence as the number of soft delete document grows in a segment the time taken to write a new soft delete document increases proportionately.
Mitigation
Setting index.merge.policy.deletes_pct_allowed
to 5% (the lowest allowed) did accelerate the merging of segments with high deleted documents and consequently helped to reduce the p100 latency to approx. 3 seconds (from 10+ seconds) but it is still 10x higher than the p100 (300ms) with soft deletes disabled.
Questions
@nknize - Thoughts?
Describe the bug Soft deletes have been enforced from 2.x without any option to disable it through #1903. Enabling soft deletes can leave segments with large number of deleted documents impacting search performance.
Workaround is to lower the default flush threshold from 512MB to lower values like 1MB to ensure the documents can be marked eligible for being expunged. Soft deletes retention policy is based on trying to retain all deleted documents above a sequence number which is based on the minimum of global checkpoint (dependent on the all follower shard copies) and local checkpoint for the safe (durable) commit of the shard. The latter does not get updated due to infrequent flushes, and causing the policy to retain the documents.
To Reproduce Steps to reproduce the behavior: The behavior can be reproduced using update heavy workload where the document size is not very high. OS flushes will not get triggered as the default flush threshold requires translog to reach 512MB in size before the local checkpoint of safe commit on shard is updated
Expected behavior Since this can impact search performance and updating the flush threshold can have other consequences, we should provide option to disable soft deletes similar to OS 1.x
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context Add any other context about the problem here.