[Performance Idea] Optimize Indexing tail latencies by disabling checkPendingFlushUpdate flag on DocumentWriter

Is your feature request related to a problem? Please describe

Opensearch triggers a full flush on lucene based on the translog size or maybe if a refresh is triggered(sometimes) or flush API is called.

Sometimes indexing threads end up flushing the segments to the disk when a full flush has been triggered. This incurs latency overhead and reduces the throughput keeping the number of clients same.

Looking deeper, it appears that indexing threads check for pending flushes on update in order to help flushing indexing buffers to disk. We can turn this off by disabling checkPendingFlushUpdate parameter in index config.

Also IndexingMemoryController reduces the total number of indexing threads per shard to 1, in order to throttle indexing when it breaches 1.5 times of indexing buffer. It starts limiting the indexing rate starting with the shard with largest memory accounting.

Describe the solution you'd like

We should ideally keep checkPendingFlushUpdate disable but we do want the indexing threads to start help with flushing somewhere between when flushing starts and throttling starts to restore to old behaviour when flushing is slow. checkForPendingFlush is a LiveIndexWriterConfig, which means that we can modify it on a live index writer instance. We can follow the existing procedure for enabling this flag used for throttling in IndexingMemoryController- start with the shard that has largest memory accounting and disable it once the memory is below the threshold.

Related component

Indexing:Performance

Describe alternatives you've considered

No response

Additional context

No response

Benchmark results for checkPendingFlushUpdate=false i.e. disabled this param by default

Dataset: nyc_taxis

Server configuration:

Server: 1 node r6g.2xlarge, 32G Heap, 64G RAM with 3 master node r6g.2xlarge, 32G Heap
Client: c6g.8xlarge, 16 client instances

Command:

opensearch-benchmark execute_test --pipeline=benchmark-only --target-hosts <HOST> --workload nyc_taxis --offline --results-format csv --results-file ~/nyc_default_false --client-options timeout:60,use_ssl:false,verify_certs:false

Glossary

Candidate -> checkPendingFlushUpdate=false
Baseline -> checkPendingFlushUpdate=true

Baseline CPU Profile:

Candidate CPU Profile:

Baseline JVM:

Candidate JVM:

Baseline CPU:

Candidate CPU:

Baseline OSB Benchmark numbers

Metric,Task,Value,Unit Cumulative indexing time of primary shards,,166.12073333333333,min Min cumulative indexing time across primary shards,,166.12073333333333,min Median cumulative indexing time across primary shards,,166.12073333333333,min Max cumulative indexing time across primary shards,,166.12073333333333,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,56.11333333333334,min Cumulative merge count of primary shards,,95, Min cumulative merge time across primary shards,,56.11333333333334,min Median cumulative merge time across primary shards,,56.11333333333334,min Max cumulative merge time across primary shards,,56.11333333333334,min Cumulative merge throttle time of primary shards,,9.88695,min Min cumulative merge throttle time across primary shards,,9.88695,min Median cumulative merge throttle time across primary shards,,9.88695,min Max cumulative merge throttle time across primary shards,,9.88695,min Cumulative refresh time of primary shards,,4.753966666666667,min Cumulative refresh count of primary shards,,87, Min cumulative refresh time across primary shards,,4.753966666666667,min Median cumulative refresh time across primary shards,,4.753966666666667,min Max cumulative refresh time across primary shards,,4.753966666666667,min Cumulative flush time of primary shards,,3.3876,min Cumulative flush count of primary shards,,34, Min cumulative flush time across primary shards,,3.3876,min Median cumulative flush time across primary shards,,3.3876,min Max cumulative flush time across primary shards,,3.3876,min Total Young Gen GC time,,4.456,s Total Young Gen GC count,,190, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,24.181725875474513,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0,MB Heap used for doc values,,0,MB Heap used for terms,,0,MB Heap used for norms,,0,MB Heap used for points,,0,MB Heap used for stored fields,,0,MB Segment count,,31, Min Throughput,index,102935.05,docs/s Mean Throughput,index,106832.91,docs/s Median Throughput,index,106525.30,docs/s Max Throughput,index,111707.44,docs/s 50th percentile latency,index,607.6435521245003,ms 90th percentile latency,index,778.8357047364116,ms 99th percentile latency,index,4463.831742014735,ms 99.9th percentile latency,index,8632.794918831472,ms 99.99th percentile latency,index,10920.354590461844,ms 100th percentile latency,index,11351.018538698554,ms 50th percentile service time,index,607.6435521245003,ms 90th percentile service time,index,778.8357047364116,ms 99th percentile service time,index,4463.831742014735,ms 99.9th percentile service time,index,8632.794918831472,ms 99.99th percentile service time,index,10920.354590461844,ms 100th percentile service time,index,11351.018538698554,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.01,ops/s Mean Throughput,wait-until-merges-finish,0.01,ops/s Median Throughput,wait-until-merges-finish,0.01,ops/s Max Throughput,wait-until-merges-finish,0.01,ops/s 100th percentile latency,wait-until-merges-finish,195198.91392998397,ms 100th percentile service time,wait-until-merges-finish,195198.91392998397,ms error rate,wait-until-merges-finish,0.00,% Min Throughput,default,3.01,ops/s Mean Throughput,default,3.02,ops/s Median Throughput,default,3.02,ops/s Max Throughput,default,3.04,ops/s 50th percentile latency,default,8.375892415642738,ms 90th percentile latency,default,9.459600038826466,ms 99th percentile latency,default,9.762189909815792,ms 100th percentile latency,default,10.369757190346718,ms 50th percentile service time,default,7.230451330542564,ms 90th percentile service time,default,8.162009716033936,ms 99th percentile service time,default,8.624615259468557,ms 100th percentile service time,default,8.977387100458145,ms error rate,default,0.00,% Min Throughput,range,0.70,ops/s Mean Throughput,range,0.70,ops/s Median Throughput,range,0.70,ops/s Max Throughput,range,0.70,ops/s 50th percentile latency,range,273.1707952916622,ms 90th percentile latency,range,275.85856057703495,ms 99th percentile latency,range,281.24235583469317,ms 100th percentile latency,range,374.64652583003044,ms 50th percentile service time,range,271.1636936292052,ms 90th percentile service time,range,273.89880307018757,ms 99th percentile service time,range,279.4290999323135,ms 100th percentile service time,range,372.9845993220806,ms error rate,range,0.00,% Min Throughput,distance_amount_agg,2.01,ops/s Mean Throughput,distance_amount_agg,2.02,ops/s Median Throughput,distance_amount_agg,2.02,ops/s Max Throughput,distance_amount_agg,2.03,ops/s 50th percentile latency,distance_amount_agg,7.590319961309433,ms 90th percentile latency,distance_amount_agg,8.574991673231125,ms 99th percentile latency,distance_amount_agg,9.029201082885267,ms 100th percentile latency,distance_amount_agg,9.34608094394207,ms 50th percentile service time,distance_amount_agg,6.567932665348053,ms 90th percentile service time,distance_amount_agg,7.1208324283361435,ms 99th percentile service time,distance_amount_agg,7.576412372291089,ms 100th percentile service time,distance_amount_agg,7.739316672086716,ms error rate,distance_amount_agg,0.00,% Min Throughput,autohisto_agg,1.51,ops/s Mean Throughput,autohisto_agg,1.51,ops/s Median Throughput,autohisto_agg,1.51,ops/s Max Throughput,autohisto_agg,1.51,ops/s 50th percentile latency,autohisto_agg,30.127479694783688,ms 90th percentile latency,autohisto_agg,30.932575650513172,ms 99th percentile latency,autohisto_agg,31.599265448749065,ms 100th percentile latency,autohisto_agg,31.663337722420692,ms 50th percentile service time,autohisto_agg,28.823536820709705,ms 90th percentile service time,autohisto_agg,29.41668853163719,ms 99th percentile service time,autohisto_agg,30.136786345392466,ms 100th percentile service time,autohisto_agg,30.239861458539963,ms error rate,autohisto_agg,0.00,% Min Throughput,date_histogram_agg,1.51,ops/s Mean Throughput,date_histogram_agg,1.52,ops/s Median Throughput,date_histogram_agg,1.51,ops/s Max Throughput,date_histogram_agg,1.53,ops/s 50th percentile latency,date_histogram_agg,29.672366566956043,ms 90th percentile latency,date_histogram_agg,30.692528933286667,ms 99th percentile latency,date_histogram_agg,32.742727659642696,ms 100th percentile latency,date_histogram_agg,32.756200060248375,ms 50th percentile service time,date_histogram_agg,28.48157472908497,ms 90th percentile service time,date_histogram_agg,29.080530256032944,ms 99th percentile service time,date_histogram_agg,30.80292604863644,ms 100th percentile service time,date_histogram_agg,31.3408263027668,ms error rate,date_histogram_agg,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,171.97156666666666,min Min cumulative indexing time across primary shards,,171.97156666666666,min Median cumulative indexing time across primary shards,,171.97156666666666,min Max cumulative indexing time across primary shards,,171.97156666666666,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,55.734766666666665,min Cumulative merge count of primary shards,,92, Min cumulative merge time across primary shards,,55.734766666666665,min Median cumulative merge time across primary shards,,55.734766666666665,min Max cumulative merge time across primary shards,,55.734766666666665,min Cumulative merge throttle time of primary shards,,10.470233333333335,min Min cumulative merge throttle time across primary shards,,10.470233333333335,min Median cumulative merge throttle time across primary shards,,10.470233333333335,min Max cumulative merge throttle time across primary shards,,10.470233333333335,min Cumulative refresh time of primary shards,,5.024533333333333,min Cumulative refresh count of primary shards,,85, Min cumulative refresh time across primary shards,,5.024533333333333,min Median cumulative refresh time across primary shards,,5.024533333333333,min Max cumulative refresh time across primary shards,,5.024533333333333,min Cumulative flush time of primary shards,,3.048983333333333,min Cumulative flush count of primary shards,,32, Min cumulative flush time across primary shards,,3.048983333333333,min Median cumulative flush time across primary shards,,3.048983333333333,min Max cumulative flush time across primary shards,,3.048983333333333,min Total Young Gen GC time,,4.332,s Total Young Gen GC count,,188, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,24.200873943977058,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0,MB Heap used for doc values,,0,MB Heap used for terms,,0,MB Heap used for norms,,0,MB Heap used for points,,0,MB Heap used for stored fields,,0,MB Segment count,,27, Min Throughput,index,99161.27,docs/s Mean Throughput,index,103294.16,docs/s Median Throughput,index,102548.70,docs/s Max Throughput,index,108729.45,docs/s 50th percentile latency,index,636.6980168968439,ms 90th percentile latency,index,782.5795926153662,ms 99th percentile latency,index,4394.596604146062,ms 99.9th percentile latency,index,8868.555354770273,ms 99.99th percentile latency,index,11895.163183703042,ms 100th percentile latency,index,12758.418155834079,ms 50th percentile service time,index,636.6980168968439,ms 90th percentile service time,index,782.5795926153662,ms 99th percentile service time,index,4394.596604146062,ms 99.9th percentile service time,index,8868.555354770273,ms 99.99th percentile service time,index,11895.163183703042,ms 100th percentile service time,index,12758.418155834079,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.00,ops/s Mean Throughput,wait-until-merges-finish,0.00,ops/s Median Throughput,wait-until-merges-finish,0.00,ops/s Max Throughput,wait-until-merges-finish,0.00,ops/s 100th percentile latency,wait-until-merges-finish,309920.96755094826,ms 100th percentile service time,wait-until-merges-finish,309920.96755094826,ms error rate,wait-until-merges-finish,0.00,% Min Throughput,default,3.02,ops/s Mean Throughput,default,3.03,ops/s Median Throughput,default,3.03,ops/s Max Throughput,default,3.06,ops/s 50th percentile latency,default,7.432621903717518,ms 90th percentile latency,default,8.798211999237537,ms 99th percentile latency,default,9.037183783948427,ms 100th percentile latency,default,10.162975639104843,ms 50th percentile service time,default,6.535369902849197,ms 90th percentile service time,default,7.540481351315975,ms 99th percentile service time,default,7.9886674322187945,ms 100th percentile service time,default,8.844876661896706,ms error rate,default,0.00,% Min Throughput,range,0.70,ops/s Mean Throughput,range,0.70,ops/s Median Throughput,range,0.70,ops/s Max Throughput,range,0.71,ops/s 50th percentile latency,range,345.31086776405573,ms 90th percentile latency,range,348.21941796690226,ms 99th percentile latency,range,350.11510644108057,ms 100th percentile latency,range,350.2962812781334,ms 50th percentile service time,range,343.4287775307894,ms 90th percentile service time,range,346.19938526302576,ms 99th percentile service time,range,347.86670569330454,ms 100th percentile service time,range,348.5223539173603,ms error rate,range,0.00,% Min Throughput,distance_amount_agg,2.01,ops/s Mean Throughput,distance_amount_agg,2.02,ops/s Median Throughput,distance_amount_agg,2.02,ops/s Max Throughput,distance_amount_agg,2.04,ops/s 50th percentile latency,distance_amount_agg,7.046643644571304,ms 90th percentile latency,distance_amount_agg,7.919052056968212,ms 99th percentile latency,distance_amount_agg,8.510909471660858,ms 100th percentile latency,distance_amount_agg,9.676463901996613,ms 50th percentile service time,distance_amount_agg,5.892262794077396,ms 90th percentile service time,distance_amount_agg,6.381535343825817,ms 99th percentile service time,distance_amount_agg,7.223732117563492,ms 100th percentile service time,distance_amount_agg,8.458975702524185,ms error rate,distance_amount_agg,0.00,% Min Throughput,autohisto_agg,1.51,ops/s Mean Throughput,autohisto_agg,1.51,ops/s Median Throughput,autohisto_agg,1.51,ops/s Max Throughput,autohisto_agg,1.52,ops/s 50th percentile latency,autohisto_agg,33.76012481749058,ms 90th percentile latency,autohisto_agg,34.766693972051144,ms 99th percentile latency,autohisto_agg,35.82435281947256,ms 100th percentile latency,autohisto_agg,37.45906986296177,ms 50th percentile service time,autohisto_agg,32.64160454273224,ms 90th percentile service time,autohisto_agg,33.27795695513487,ms 99th percentile service time,autohisto_agg,34.08266251906753,ms 100th percentile service time,autohisto_agg,35.8007550239563,ms error rate,autohisto_agg,0.00,% Min Throughput,date_histogram_agg,1.51,ops/s Mean Throughput,date_histogram_agg,1.52,ops/s Median Throughput,date_histogram_agg,1.51,ops/s Max Throughput,date_histogram_agg,1.53,ops/s 50th percentile latency,date_histogram_agg,33.62984023988247,ms 90th percentile latency,date_histogram_agg,34.47766453027725,ms 99th percentile latency,date_histogram_agg,35.4248688928783,ms 100th percentile latency,date_histogram_agg,36.34903021156788,ms 50th percentile service time,date_histogram_agg,32.395738177001476,ms 90th percentile service time,date_histogram_agg,33.016218803822994,ms 99th percentile service time,date_histogram_agg,34.16350495070219,ms 100th percentile service time,date_histogram_agg,34.48585048317909,ms error rate,date_histogram_agg,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,174.62096666666667,min Min cumulative indexing time across primary shards,,174.62096666666667,min Median cumulative indexing time across primary shards,,174.62096666666667,min Max cumulative indexing time across primary shards,,174.62096666666667,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,51.50618333333333,min Cumulative merge count of primary shards,,91, Min cumulative merge time across primary shards,,51.50618333333333,min Median cumulative merge time across primary shards,,51.50618333333333,min Max cumulative merge time across primary shards,,51.50618333333333,min Cumulative merge throttle time of primary shards,,8.260366666666666,min Min cumulative merge throttle time across primary shards,,8.260366666666666,min Median cumulative merge throttle time across primary shards,,8.260366666666666,min Max cumulative merge throttle time across primary shards,,8.260366666666666,min Cumulative refresh time of primary shards,,4.797866666666667,min Cumulative refresh count of primary shards,,85, Min cumulative refresh time across primary shards,,4.797866666666667,min Median cumulative refresh time across primary shards,,4.797866666666667,min Max cumulative refresh time across primary shards,,4.797866666666667,min Cumulative flush time of primary shards,,3.18655,min Cumulative flush count of primary shards,,31, Min cumulative flush time across primary shards,,3.18655,min Median cumulative flush time across primary shards,,3.18655,min Max cumulative flush time across primary shards,,3.18655,min Total Young Gen GC time,,4.45,s Total Young Gen GC count,,185, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,24.13362715765834,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0,MB Heap used for doc values,,0,MB Heap used for terms,,0,MB Heap used for norms,,0,MB Heap used for points,,0,MB Heap used for stored fields,,0,MB Segment count,,29, Min Throughput,index,98863.10,docs/s Mean Throughput,index,102276.63,docs/s Median Throughput,index,101325.67,docs/s Max Throughput,index,107477.73,docs/s 50th percentile latency,index,641.0772819072008,ms 90th percentile latency,index,805.0344284623862,ms 99th percentile latency,index,4239.789214581235,ms 99.9th percentile latency,index,8238.246840011338,ms 99.99th percentile latency,index,10563.373674850958,ms 100th percentile latency,index,11133.489748463035,ms 50th percentile service time,index,641.0772819072008,ms 90th percentile service time,index,805.0344284623862,ms 99th percentile service time,index,4239.789214581235,ms 99.9th percentile service time,index,8238.246840011338,ms 99.99th percentile service time,index,10563.373674850958,ms 100th percentile service time,index,11133.489748463035,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.01,ops/s Mean Throughput,wait-until-merges-finish,0.01,ops/s Median Throughput,wait-until-merges-finish,0.01,ops/s Max Throughput,wait-until-merges-finish,0.01,ops/s 100th percentile latency,wait-until-merges-finish,83157.24238567054,ms 100th percentile service time,wait-until-merges-finish,83157.24238567054,ms error rate,wait-until-merges-finish,0.00,% Min Throughput,default,3.02,ops/s Mean Throughput,default,3.03,ops/s Median Throughput,default,3.03,ops/s Max Throughput,default,3.06,ops/s 50th percentile latency,default,7.301744073629379,ms 90th percentile latency,default,8.24655033648014,ms 99th percentile latency,default,8.777236193418508,ms 100th percentile latency,default,9.831609204411507,ms 50th percentile service time,default,6.4235227182507515,ms 90th percentile service time,default,6.882750801742077,ms 99th percentile service time,default,7.474049162119635,ms 100th percentile service time,default,8.968735113739967,ms error rate,default,0.00,% Min Throughput,range,0.70,ops/s Mean Throughput,range,0.70,ops/s Median Throughput,range,0.70,ops/s Max Throughput,range,0.71,ops/s 50th percentile latency,range,355.91146629303694,ms 90th percentile latency,range,358.05234778672457,ms 99th percentile latency,range,359.16412690654397,ms 100th percentile latency,range,359.3240715563297,ms 50th percentile service time,range,353.69278211146593,ms 90th percentile service time,range,355.67412227392197,ms 99th percentile service time,range,356.9906888715923,ms 100th percentile service time,range,357.61797055602074,ms error rate,range,0.00,% Min Throughput,distance_amount_agg,2.01,ops/s Mean Throughput,distance_amount_agg,2.02,ops/s Median Throughput,distance_amount_agg,2.02,ops/s Max Throughput,distance_amount_agg,2.04,ops/s 50th percentile latency,distance_amount_agg,6.823156960308552,ms 90th percentile latency,distance_amount_agg,7.605464570224285,ms 99th percentile latency,distance_amount_agg,8.251960612833505,ms 100th percentile latency,distance_amount_agg,9.161913767457008,ms 50th percentile service time,distance_amount_agg,5.6030405685305595,ms 90th percentile service time,distance_amount_agg,6.014593131840229,ms 99th percentile service time,distance_amount_agg,6.720083449035887,ms 100th percentile service time,distance_amount_agg,7.5499527156353,ms error rate,distance_amount_agg,0.00,% Min Throughput,autohisto_agg,1.51,ops/s Mean Throughput,autohisto_agg,1.51,ops/s Median Throughput,autohisto_agg,1.51,ops/s Max Throughput,autohisto_agg,1.52,ops/s 50th percentile latency,autohisto_agg,29.60855048149824,ms 90th percentile latency,autohisto_agg,30.888007394969463,ms 99th percentile latency,autohisto_agg,31.340284459292892,ms 100th percentile latency,autohisto_agg,32.22211264073849,ms 50th percentile service time,autohisto_agg,28.252952732145786,ms 90th percentile service time,autohisto_agg,29.23851851373911,ms 99th percentile service time,autohisto_agg,29.79292711243034,ms 100th percentile service time,autohisto_agg,30.895641073584557,ms error rate,autohisto_agg,0.00,% Min Throughput,date_histogram_agg,1.51,ops/s Mean Throughput,date_histogram_agg,1.52,ops/s Median Throughput,date_histogram_agg,1.51,ops/s Max Throughput,date_histogram_agg,1.53,ops/s 50th percentile latency,date_histogram_agg,29.55853007733822,ms 90th percentile latency,date_histogram_agg,30.70311602205038,ms 99th percentile latency,date_histogram_agg,31.523105334490545,ms 100th percentile latency,date_histogram_agg,32.87980705499649,ms 50th percentile service time,date_histogram_agg,28.509829193353653,ms 90th percentile service time,date_histogram_agg,29.013352654874325,ms 99th percentile service time,date_histogram_agg,29.9590489268303,ms 100th percentile service time,date_histogram_agg,31.358644366264343,ms error rate,date_histogram_agg,0.00,%

OSB Benchmark numbers for candidate

Metric,Task,Value,Unit Cumulative indexing time of primary shards,,158.44041666666666,min Min cumulative indexing time across primary shards,,158.44041666666666,min Median cumulative indexing time across primary shards,,158.44041666666666,min Max cumulative indexing time across primary shards,,158.44041666666666,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,52.35973333333333,min Cumulative merge count of primary shards,,17, Min cumulative merge time across primary shards,,52.35973333333333,min Median cumulative merge time across primary shards,,52.35973333333333,min Max cumulative merge time across primary shards,,52.35973333333333,min Cumulative merge throttle time of primary shards,,13.8921,min Min cumulative merge throttle time across primary shards,,13.8921,min Median cumulative merge throttle time across primary shards,,13.8921,min Max cumulative merge throttle time across primary shards,,13.8921,min Cumulative refresh time of primary shards,,25.259083333333333,min Cumulative refresh count of primary shards,,27, Min cumulative refresh time across primary shards,,25.259083333333333,min Median cumulative refresh time across primary shards,,25.259083333333333,min Max cumulative refresh time across primary shards,,25.259083333333333,min Cumulative flush time of primary shards,,26.971116666666667,min Cumulative flush count of primary shards,,7, Min cumulative flush time across primary shards,,26.971116666666667,min Median cumulative flush time across primary shards,,26.971116666666667,min Max cumulative flush time across primary shards,,26.971116666666667,min Total Young Gen GC time,,9.11,s Total Young Gen GC count,,278, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,24.24579533841461,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0,MB Heap used for doc values,,0,MB Heap used for terms,,0,MB Heap used for norms,,0,MB Heap used for points,,0,MB Heap used for stored fields,,0,MB Segment count,,32, Min Throughput,index,104821.32,docs/s Mean Throughput,index,110887.99,docs/s Median Throughput,index,108932.24,docs/s Max Throughput,index,117666.24,docs/s 50th percentile latency,index,648.9469930529594,ms 90th percentile latency,index,822.9454454034567,ms 99th percentile latency,index,1430.5248411744824,ms 99.9th percentile latency,index,8186.396106466737,ms 99.99th percentile latency,index,44501.00298040326,ms 100th percentile latency,index,45815.12422114611,ms 50th percentile service time,index,648.9469930529594,ms 90th percentile service time,index,822.9454454034567,ms 99th percentile service time,index,1430.5248411744824,ms 99.9th percentile service time,index,8186.396106466737,ms 99.99th percentile service time,index,44501.00298040326,ms 100th percentile service time,index,45815.12422114611,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.00,ops/s Mean Throughput,wait-until-merges-finish,0.00,ops/s Median Throughput,wait-until-merges-finish,0.00,ops/s Max Throughput,wait-until-merges-finish,0.00,ops/s 100th percentile latency,wait-until-merges-finish,280357.4115540832,ms 100th percentile service time,wait-until-merges-finish,280357.4115540832,ms error rate,wait-until-merges-finish,0.00,% Min Throughput,default,3.01,ops/s Mean Throughput,default,3.02,ops/s Median Throughput,default,3.02,ops/s Max Throughput,default,3.03,ops/s 50th percentile latency,default,8.136204443871975,ms 90th percentile latency,default,9.59649607539177,ms 99th percentile latency,default,10.137022715061903,ms 100th percentile latency,default,10.254787281155586,ms 50th percentile service time,default,7.168031297624111,ms 90th percentile service time,default,8.350671268999577,ms 99th percentile service time,default,8.876332081854347,ms 100th percentile service time,default,9.449176490306854,ms error rate,default,0.00,% Min Throughput,range,0.70,ops/s Mean Throughput,range,0.70,ops/s Median Throughput,range,0.70,ops/s Max Throughput,range,0.70,ops/s 50th percentile latency,range,279.5456536114216,ms 90th percentile latency,range,285.2178543806076,ms 99th percentile latency,range,286.9402009434998,ms 100th percentile latency,range,287.03615814447403,ms 50th percentile service time,range,277.2725848481059,ms 90th percentile service time,range,283.0840164795518,ms 99th percentile service time,range,283.90586426481605,ms 100th percentile service time,range,285.5638712644577,ms error rate,range,0.00,% Min Throughput,distance_amount_agg,2.01,ops/s Mean Throughput,distance_amount_agg,2.01,ops/s Median Throughput,distance_amount_agg,2.01,ops/s Max Throughput,distance_amount_agg,2.02,ops/s 50th percentile latency,distance_amount_agg,7.831607013940811,ms 90th percentile latency,distance_amount_agg,8.740315772593021,ms 99th percentile latency,distance_amount_agg,8.978742752224207,ms 100th percentile latency,distance_amount_agg,9.081235155463219,ms 50th percentile service time,distance_amount_agg,6.861918605864048,ms 90th percentile service time,distance_amount_agg,7.231817208230496,ms 99th percentile service time,distance_amount_agg,7.775158677250148,ms 100th percentile service time,distance_amount_agg,7.8740958124399185,ms error rate,distance_amount_agg,0.00,% Min Throughput,autohisto_agg,1.50,ops/s Mean Throughput,autohisto_agg,1.51,ops/s Median Throughput,autohisto_agg,1.51,ops/s Max Throughput,autohisto_agg,1.51,ops/s 50th percentile latency,autohisto_agg,37.13328577578068,ms 90th percentile latency,autohisto_agg,38.28908670693636,ms 99th percentile latency,autohisto_agg,39.50127979740502,ms 100th percentile latency,autohisto_agg,42.17568039894104,ms 50th percentile service time,autohisto_agg,35.98613943904638,ms 90th percentile service time,autohisto_agg,36.866821348667145,ms 99th percentile service time,autohisto_agg,37.84436438232662,ms 100th percentile service time,autohisto_agg,40.58299399912357,ms error rate,autohisto_agg,0.00,% Min Throughput,date_histogram_agg,1.51,ops/s Mean Throughput,date_histogram_agg,1.52,ops/s Median Throughput,date_histogram_agg,1.51,ops/s Max Throughput,date_histogram_agg,1.53,ops/s 50th percentile latency,date_histogram_agg,36.80304158478975,ms 90th percentile latency,date_histogram_agg,37.72242069244385,ms 99th percentile latency,date_histogram_agg,37.99550577998161,ms 100th percentile latency,date_histogram_agg,38.19887153804302,ms 50th percentile service time,date_histogram_agg,35.5253629386425,ms 90th percentile service time,date_histogram_agg,36.14485710859299,ms 99th percentile service time,date_histogram_agg,36.58429052680731,ms 100th percentile service time,date_histogram_agg,36.79274208843708,ms error rate,date_histogram_agg,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,169.18276666666668,min Min cumulative indexing time across primary shards,,169.18276666666668,min Median cumulative indexing time across primary shards,,169.18276666666668,min Max cumulative indexing time across primary shards,,169.18276666666668,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,46.66856666666667,min Cumulative merge count of primary shards,,13, Min cumulative merge time across primary shards,,46.66856666666667,min Median cumulative merge time across primary shards,,46.66856666666667,min Max cumulative merge time across primary shards,,46.66856666666667,min Cumulative merge throttle time of primary shards,,11.540666666666668,min Min cumulative merge throttle time across primary shards,,11.540666666666668,min Median cumulative merge throttle time across primary shards,,11.540666666666668,min Max cumulative merge throttle time across primary shards,,11.540666666666668,min Cumulative refresh time of primary shards,,27.764316666666666,min Cumulative refresh count of primary shards,,21, Min cumulative refresh time across primary shards,,27.764316666666666,min Median cumulative refresh time across primary shards,,27.764316666666666,min Max cumulative refresh time across primary shards,,27.764316666666666,min Cumulative flush time of primary shards,,29.024583333333332,min Cumulative flush count of primary shards,,6, Min cumulative flush time across primary shards,,29.024583333333332,min Median cumulative flush time across primary shards,,29.024583333333332,min Max cumulative flush time across primary shards,,29.024583333333332,min Total Young Gen GC time,,9.333,s Total Young Gen GC count,,313, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,24.237015244551003,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0,MB Heap used for doc values,,0,MB Heap used for terms,,0,MB Heap used for norms,,0,MB Heap used for points,,0,MB Heap used for stored fields,,0,MB Segment count,,33, Min Throughput,index,100603.50,docs/s Mean Throughput,index,107291.69,docs/s Median Throughput,index,107960.80,docs/s Max Throughput,index,114235.02,docs/s 50th percentile latency,index,688.9913994818926,ms 90th percentile latency,index,897.1360813826324,ms 99th percentile latency,index,1572.8789707273245,ms 99.9th percentile latency,index,2215.76308912043,ms 99.99th percentile latency,index,30118.3809782107,ms 100th percentile latency,index,34981.26640543342,ms 50th percentile service time,index,688.9913994818926,ms 90th percentile service time,index,897.1360813826324,ms 99th percentile service time,index,1572.8789707273245,ms 99.9th percentile service time,index,2215.76308912043,ms 99.99th percentile service time,index,30118.3809782107,ms 100th percentile service time,index,34981.26640543342,ms error rate,index,0.00,% 100th percentile latency,refresh-after-index,60006.5230615437,ms 100th percentile service time,refresh-after-index,60006.5230615437,ms error rate,refresh-after-index,100.00,% Min Throughput,wait-until-merges-finish,0.01,ops/s Mean Throughput,wait-until-merges-finish,0.01,ops/s Median Throughput,wait-until-merges-finish,0.01,ops/s Max Throughput,wait-until-merges-finish,0.01,ops/s 100th percentile latency,wait-until-merges-finish,102859.27774943411,ms 100th percentile service time,wait-until-merges-finish,102859.27774943411,ms error rate,wait-until-merges-finish,0.00,% Min Throughput,default,3.02,ops/s Mean Throughput,default,3.03,ops/s Median Throughput,default,3.03,ops/s Max Throughput,default,3.06,ops/s 50th percentile latency,default,7.794086821377277,ms 90th percentile latency,default,8.975609205663204,ms 99th percentile latency,default,9.252204820513727,ms 100th percentile latency,default,9.450186043977737,ms 50th percentile service time,default,6.550047546625137,ms 90th percentile service time,default,7.5575146824121475,ms 99th percentile service time,default,7.812825720757248,ms 100th percentile service time,default,8.219407871365547,ms error rate,default,0.00,% Min Throughput,range,0.70,ops/s Mean Throughput,range,0.70,ops/s Median Throughput,range,0.70,ops/s Max Throughput,range,0.71,ops/s 50th percentile latency,range,358.8522868230939,ms 90th percentile latency,range,361.9230380281806,ms 99th percentile latency,range,362.9597289673984,ms 100th percentile latency,range,366.18228256702423,ms 50th percentile service time,range,356.6555790603161,ms 90th percentile service time,range,359.7063710913062,ms 99th percentile service time,range,360.7538510672748,ms 100th percentile service time,range,364.3028549849987,ms error rate,range,0.00,% Min Throughput,distance_amount_agg,2.01,ops/s Mean Throughput,distance_amount_agg,2.02,ops/s Median Throughput,distance_amount_agg,2.02,ops/s Max Throughput,distance_amount_agg,2.04,ops/s 50th percentile latency,distance_amount_agg,7.098056375980377,ms 90th percentile latency,distance_amount_agg,8.023283258080482,ms 99th percentile latency,distance_amount_agg,9.524956177920107,ms 100th percentile latency,distance_amount_agg,10.195240378379822,ms 50th percentile service time,distance_amount_agg,6.021519191563129,ms 90th percentile service time,distance_amount_agg,6.564447842538357,ms 99th percentile service time,distance_amount_agg,8.38703900575638,ms 100th percentile service time,distance_amount_agg,8.654473349452019,ms error rate,distance_amount_agg,0.00,% Min Throughput,autohisto_agg,1.51,ops/s Mean Throughput,autohisto_agg,1.51,ops/s Median Throughput,autohisto_agg,1.51,ops/s Max Throughput,autohisto_agg,1.52,ops/s 50th percentile latency,autohisto_agg,34.77202355861664,ms 90th percentile latency,autohisto_agg,35.73775440454483,ms 99th percentile latency,autohisto_agg,37.03571487218142,ms 100th percentile latency,autohisto_agg,37.81924210488796,ms 50th percentile service time,autohisto_agg,33.599209040403366,ms 90th percentile service time,autohisto_agg,34.084855765104294,ms 99th percentile service time,autohisto_agg,35.35183321684599,ms 100th percentile service time,autohisto_agg,35.95112822949886,ms error rate,autohisto_agg,0.00,% Min Throughput,date_histogram_agg,1.51,ops/s Mean Throughput,date_histogram_agg,1.52,ops/s Median Throughput,date_histogram_agg,1.51,ops/s Max Throughput,date_histogram_agg,1.53,ops/s 50th percentile latency,date_histogram_agg,34.393384121358395,ms 90th percentile latency,date_histogram_agg,35.67872177809477,ms 99th percentile latency,date_histogram_agg,36.48342963308096,ms 100th percentile latency,date_histogram_agg,36.7277879267931,ms 50th percentile service time,date_histogram_agg,33.26811455190182,ms 90th percentile service time,date_histogram_agg,33.963812328875065,ms 99th percentile service time,date_histogram_agg,34.67505659908056,ms 100th percentile service time,date_histogram_agg,35.12430749833584,ms error rate,date_histogram_agg,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,175.37105000000003,min Min cumulative indexing time across primary shards,,175.37105000000003,min Median cumulative indexing time across primary shards,,175.37105000000003,min Max cumulative indexing time across primary shards,,175.37105000000003,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,47.98173333333333,min Cumulative merge count of primary shards,,16, Min cumulative merge time across primary shards,,47.98173333333333,min Median cumulative merge time across primary shards,,47.98173333333333,min Max cumulative merge time across primary shards,,47.98173333333333,min Cumulative merge throttle time of primary shards,,10.788783333333333,min Min cumulative merge throttle time across primary shards,,10.788783333333333,min Median cumulative merge throttle time across primary shards,,10.788783333333333,min Max cumulative merge throttle time across primary shards,,10.788783333333333,min Cumulative refresh time of primary shards,,28.85685,min Cumulative refresh count of primary shards,,29, Min cumulative refresh time across primary shards,,28.85685,min Median cumulative refresh time across primary shards,,28.85685,min Max cumulative refresh time across primary shards,,28.85685,min Cumulative flush time of primary shards,,30.5121,min Cumulative flush count of primary shards,,7, Min cumulative flush time across primary shards,,30.5121,min Median cumulative flush time across primary shards,,30.5121,min Max cumulative flush time across primary shards,,30.5121,min Total Young Gen GC time,,11.093,s Total Young Gen GC count,,429, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,24.40468918904662,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0,MB Heap used for doc values,,0,MB Heap used for terms,,0,MB Heap used for norms,,0,MB Heap used for points,,0,MB Heap used for stored fields,,0,MB Segment count,,27, Min Throughput,index,97308.44,docs/s Mean Throughput,index,102092.40,docs/s Median Throughput,index,102342.21,docs/s Max Throughput,index,105932.01,docs/s 50th percentile latency,index,705.1346320658922,ms 90th percentile latency,index,906.0634296387434,ms 99th percentile latency,index,1647.6960009336406,ms 99.9th percentile latency,index,1942.756052456819,ms 99.99th percentile latency,index,16324.243006371802,ms 100th percentile latency,index,34903.892781585455,ms 50th percentile service time,index,705.1346320658922,ms 90th percentile service time,index,906.0634296387434,ms 99th percentile service time,index,1647.6960009336406,ms 99.9th percentile service time,index,1942.756052456819,ms 99.99th percentile service time,index,16324.243006371802,ms 100th percentile service time,index,34903.892781585455,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.00,ops/s Mean Throughput,wait-until-merges-finish,0.00,ops/s Median Throughput,wait-until-merges-finish,0.00,ops/s Max Throughput,wait-until-merges-finish,0.00,ops/s 100th percentile latency,wait-until-merges-finish,472120.0103741139,ms 100th percentile service time,wait-until-merges-finish,472120.0103741139,ms error rate,wait-until-merges-finish,0.00,% Min Throughput,default,3.02,ops/s Mean Throughput,default,3.03,ops/s Median Throughput,default,3.03,ops/s Max Throughput,default,3.06,ops/s 50th percentile latency,default,7.055387832224369,ms 90th percentile latency,default,8.281622268259525,ms 99th percentile latency,default,8.84520849213005,ms 100th percentile latency,default,10.489363223314285,ms 50th percentile service time,default,6.3536157831549644,ms 90th percentile service time,default,6.98251836001873,ms 99th percentile service time,default,7.773254457861191,ms 100th percentile service time,default,9.023906663060188,ms error rate,default,0.00,% Min Throughput,range,0.70,ops/s Mean Throughput,range,0.70,ops/s Median Throughput,range,0.70,ops/s Max Throughput,range,0.71,ops/s 50th percentile latency,range,352.5768080726266,ms 90th percentile latency,range,355.04613369703293,ms 99th percentile latency,range,357.9169290699068,ms 100th percentile latency,range,428.6701064556837,ms 50th percentile service time,range,350.5545351654291,ms 90th percentile service time,range,353.1094944104552,ms 99th percentile service time,range,355.78593747690354,ms 100th percentile service time,range,426.5150688588619,ms error rate,range,0.00,% Min Throughput,distance_amount_agg,2.01,ops/s Mean Throughput,distance_amount_agg,2.02,ops/s Median Throughput,distance_amount_agg,2.02,ops/s Max Throughput,distance_amount_agg,2.04,ops/s 50th percentile latency,distance_amount_agg,6.628374569118023,ms 90th percentile latency,distance_amount_agg,7.544291764497757,ms 99th percentile latency,distance_amount_agg,7.867723572999247,ms 100th percentile latency,distance_amount_agg,9.468503296375275,ms 50th percentile service time,distance_amount_agg,5.522269755601883,ms 90th percentile service time,distance_amount_agg,6.032828241586685,ms 99th percentile service time,distance_amount_agg,6.759856250137096,ms 100th percentile service time,distance_amount_agg,7.856529206037521,ms error rate,distance_amount_agg,0.00,% Min Throughput,autohisto_agg,1.51,ops/s Mean Throughput,autohisto_agg,1.51,ops/s Median Throughput,autohisto_agg,1.51,ops/s Max Throughput,autohisto_agg,1.52,ops/s 50th percentile latency,autohisto_agg,27.66093611717224,ms 90th percentile latency,autohisto_agg,28.590047731995583,ms 99th percentile latency,autohisto_agg,30.41996924206615,ms 100th percentile latency,autohisto_agg,31.147735193371773,ms 50th percentile service time,autohisto_agg,26.485727168619633,ms 90th percentile service time,autohisto_agg,26.98627896606922,ms 99th percentile service time,autohisto_agg,29.019740596413612,ms 100th percentile service time,autohisto_agg,29.264921322464943,ms error rate,autohisto_agg,0.00,% Min Throughput,date_histogram_agg,1.51,ops/s Mean Throughput,date_histogram_agg,1.52,ops/s Median Throughput,date_histogram_agg,1.51,ops/s Max Throughput,date_histogram_agg,1.53,ops/s 50th percentile latency,date_histogram_agg,27.710165828466415,ms 90th percentile latency,date_histogram_agg,28.586168214678764,ms 99th percentile latency,date_histogram_agg,29.26172593608499,ms 100th percentile latency,date_histogram_agg,30.13956919312477,ms 50th percentile service time,date_histogram_agg,26.48189291357994,ms 90th percentile service time,date_histogram_agg,26.99560821056366,ms 99th percentile service time,date_histogram_agg,27.749577872455124,ms 100th percentile service time,date_histogram_agg,28.27257476747036,ms error rate,date_histogram_agg,0.00,%

Although tail latency i.e. p99 has decreased and there's a slight improvement in indexing throughput but other metrics like p90, merge throttling time etc. suffered.

What are your thoughts on the numbers and overall idea of disabling flush param @mgodwan / @backslasht / @shwetathareja ?

Thanks @CaptainDredge for the benchmark and the suggested tuning. I'm still going through the overall optimization and measurements. In the meanwhile, I've a few questions if you can help answer them:

On a high level, do we see any changes in the number of segments (before force merge is executed) and segment sizes? Is it possible to a get a time series view of these stats throughout the run?
Can we check if there was any reduction in the write threads being blocked? I guess async-profiler/JFR should help collecting this if not already present.
Is there any increase in latency or CPU usage for operations such as merges/refresh/periodic flushes which may end up doing the work earlier done by write thread?

Sure I think I need to track few more metrics during benchmark run to address above Questions. Will try to add those and do 1-2 runs

Few questions on top of what @mgodwan asked :

Why are we seeing more flushes in candidate as opposed to base which has checkPendingFlushUpdate = true?
Also p50 and p90 latencies are higher with candidate, any insights?
No. of segments shouldnt vary between the two (baseline and candidate) as refresh interval remains the same.

Redid the benchmarks with additional metric tracking and got much consistent results

Benchmark results for checkPendingFlushUpdate=false i.e. disabled this param by default

Dataset: nyc_taxis

Server configuration:

Server: 1 node r6g.2xlarge, 32G Heap, 64G RAM with 3 master node r6g.2xlarge, 32G Heap
Client: c6g.8xlarge, 5 client instances

Command:

opensearch-benchmark execute_test --pipeline=benchmark-only --target-hosts <HOST> --workload nyc_taxis --offline --results-format csv --results-file ~/nyc_default_false --client-options timeout:60,use_ssl:false,verify_certs:false

Glossary

Candidate -> checkPendingFlushUpdate=false
Baseline -> checkPendingFlushUpdate=true

CPU Profile comparison [ A -> candidate, B -> Baseline ] doFlush as well as preUpdate functions where threads check for flush are called more frequently in candidate

Refreshes are higher in baseline due to more number of segments in baseline

Macro CPU:

JVM Comparison:

Lock profiles:

Candidate:

Baseline:

Flush metrics: Number of flushes are higher in baseline because writer threads check for flush on each update and thus flushes happens frequently but total flush time is higher in candidate

Number of segments less in candidate which is expected since flushes are occurring less frequently

Indexing writer is holding segments in memory a little longer but the memory occupied is not too concerning (<=1Gb)

Baseline OSB Benchmark numbers

Metric,Task,Value,Unit Cumulative indexing time of primary shards,,160.47803333333334,min Min cumulative indexing time across primary shards,,160.47803333333334,min Median cumulative indexing time across primary shards,,160.47803333333334,min Max cumulative indexing time across primary shards,,160.47803333333334,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,71.74595,min Cumulative merge count of primary shards,,69, Min cumulative merge time across primary shards,,71.74595,min Median cumulative merge time across primary shards,,71.74595,min Max cumulative merge time across primary shards,,71.74595,min Cumulative merge throttle time of primary shards,,28.590616666666666,min Min cumulative merge throttle time across primary shards,,28.590616666666666,min Median cumulative merge throttle time across primary shards,,28.590616666666666,min Max cumulative merge throttle time across primary shards,,28.590616666666666,min Cumulative refresh time of primary shards,,5.5779,min Cumulative refresh count of primary shards,,107, Min cumulative refresh time across primary shards,,5.5779,min Median cumulative refresh time across primary shards,,5.5779,min Max cumulative refresh time across primary shards,,5.5779,min Cumulative flush time of primary shards,,2.91075,min Cumulative flush count of primary shards,,32, Min cumulative flush time across primary shards,,2.91075,min Median cumulative flush time across primary shards,,2.91075,min Max cumulative flush time across primary shards,,2.91075,min Total Young Gen GC time,,3.05,s Total Young Gen GC count,,174, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,23.8011878663674,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0.0,MB Heap used for doc values,,0.0,MB Heap used for terms,,0.0,MB Heap used for norms,,0.0,MB Heap used for points,,0.0,MB Heap used for stored fields,,0.0,MB Segment count,,29, Min Throughput,index,70413.46,docs/s Mean Throughput,index,71458.73,docs/s Median Throughput,index,71394.10,docs/s Max Throughput,index,72696.09,docs/s 50th percentile latency,index,605.4159615626703,ms 90th percentile latency,index,657.1996043337641,ms 99th percentile latency,index,3482.827952473957,ms 99.9th percentile latency,index,7201.034707031424,ms 99.99th percentile latency,index,8073.713661718741,ms 100th percentile latency,index,8079.45068359375,ms 50th percentile service time,index,605.4159615626703,ms 90th percentile service time,index,657.1996043337641,ms 99th percentile service time,index,3482.827952473957,ms 99.9th percentile service time,index,7201.034707031424,ms 99.99th percentile service time,index,8073.713661718741,ms 100th percentile service time,index,8079.45068359375,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.00,ops/s Mean Throughput,wait-until-merges-finish,0.00,ops/s Median Throughput,wait-until-merges-finish,0.00,ops/s Max Throughput,wait-until-merges-finish,0.00,ops/s 100th percentile latency,wait-until-merges-finish,270079.65625,ms 100th percentile service time,wait-until-merges-finish,270079.65625,ms error rate,wait-until-merges-finish,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,162.08225,min Min cumulative indexing time across primary shards,,162.08225,min Median cumulative indexing time across primary shards,,162.08225,min Max cumulative indexing time across primary shards,,162.08225,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,66.11128333333333,min Cumulative merge count of primary shards,,70, Min cumulative merge time across primary shards,,66.11128333333333,min Median cumulative merge time across primary shards,,66.11128333333333,min Max cumulative merge time across primary shards,,66.11128333333333,min Cumulative merge throttle time of primary shards,,24.2181,min Min cumulative merge throttle time across primary shards,,24.2181,min Median cumulative merge throttle time across primary shards,,24.2181,min Max cumulative merge throttle time across primary shards,,24.2181,min Cumulative refresh time of primary shards,,5.670816666666667,min Cumulative refresh count of primary shards,,108, Min cumulative refresh time across primary shards,,5.670816666666667,min Median cumulative refresh time across primary shards,,5.670816666666667,min Max cumulative refresh time across primary shards,,5.670816666666667,min Cumulative flush time of primary shards,,2.7515666666666667,min Cumulative flush count of primary shards,,33, Min cumulative flush time across primary shards,,2.7515666666666667,min Median cumulative flush time across primary shards,,2.7515666666666667,min Max cumulative flush time across primary shards,,2.7515666666666667,min Total Young Gen GC time,,3.295,s Total Young Gen GC count,,209, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,28.78564189001918,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0.0,MB Heap used for doc values,,0.0,MB Heap used for terms,,0.0,MB Heap used for norms,,0.0,MB Heap used for points,,0.0,MB Heap used for stored fields,,0.0,MB Segment count,,34, Min Throughput,index,69258.45,docs/s Mean Throughput,index,70292.67,docs/s Median Throughput,index,70238.87,docs/s Max Throughput,index,71074.70,docs/s 50th percentile latency,index,612.6130574218281,ms 90th percentile latency,index,662.8810661411829,ms 99th percentile latency,index,3486.1739264787902,ms 99.9th percentile latency,index,7181.957399902394,ms 99.99th percentile latency,index,8293.0933571286,ms 100th percentile latency,index,8451.947265625,ms 50th percentile service time,index,612.6130574218281,ms 90th percentile service time,index,662.8810661411829,ms 99th percentile service time,index,3486.1739264787902,ms 99.9th percentile service time,index,7181.957399902394,ms 99.99th percentile service time,index,8293.0933571286,ms 100th percentile service time,index,8451.947265625,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.00,ops/s Mean Throughput,wait-until-merges-finish,0.00,ops/s Median Throughput,wait-until-merges-finish,0.00,ops/s Max Throughput,wait-until-merges-finish,0.00,ops/s 100th percentile latency,wait-until-merges-finish,219999.90625,ms 100th percentile service time,wait-until-merges-finish,219999.90625,ms error rate,wait-until-merges-finish,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,162.34881666666666,min Min cumulative indexing time across primary shards,,162.34881666666666,min Median cumulative indexing time across primary shards,,162.34881666666666,min Max cumulative indexing time across primary shards,,162.34881666666666,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,72.96123333333334,min Cumulative merge count of primary shards,,67, Min cumulative merge time across primary shards,,72.96123333333334,min Median cumulative merge time across primary shards,,72.96123333333334,min Max cumulative merge time across primary shards,,72.96123333333334,min Cumulative merge throttle time of primary shards,,31.61315,min Min cumulative merge throttle time across primary shards,,31.61315,min Median cumulative merge throttle time across primary shards,,31.61315,min Max cumulative merge throttle time across primary shards,,31.61315,min Cumulative refresh time of primary shards,,5.835483333333333,min Cumulative refresh count of primary shards,,108, Min cumulative refresh time across primary shards,,5.835483333333333,min Median cumulative refresh time across primary shards,,5.835483333333333,min Max cumulative refresh time across primary shards,,5.835483333333333,min Cumulative flush time of primary shards,,2.607966666666667,min Cumulative flush count of primary shards,,31, Min cumulative flush time across primary shards,,2.607966666666667,min Median cumulative flush time across primary shards,,2.607966666666667,min Max cumulative flush time across primary shards,,2.607966666666667,min Total Young Gen GC time,,3.482,s Total Young Gen GC count,,221, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,28.772654306143522,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0.0,MB Heap used for doc values,,0.0,MB Heap used for terms,,0.0,MB Heap used for norms,,0.0,MB Heap used for points,,0.0,MB Heap used for stored fields,,0.0,MB Segment count,,43, Min Throughput,index,69013.34,docs/s Mean Throughput,index,70167.70,docs/s Median Throughput,index,70127.00,docs/s Max Throughput,index,71273.93,docs/s 50th percentile latency,index,612.7077060643543,ms 90th percentile latency,index,663.1674384570314,ms 99th percentile latency,index,3537.240791015618,ms 99.9th percentile latency,index,6925.188898437524,ms 99.99th percentile latency,index,8229.463450390302,ms 100th percentile latency,index,8429.31640625,ms 50th percentile service time,index,612.7077060643543,ms 90th percentile service time,index,663.1674384570314,ms 99th percentile service time,index,3537.240791015618,ms 99.9th percentile service time,index,6925.188898437524,ms 99.99th percentile service time,index,8229.463450390302,ms 100th percentile service time,index,8429.31640625,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.00,ops/s Mean Throughput,wait-until-merges-finish,0.00,ops/s Median Throughput,wait-until-merges-finish,0.00,ops/s Max Throughput,wait-until-merges-finish,0.00,ops/s 100th percentile latency,wait-until-merges-finish,442901.125,ms 100th percentile service time,wait-until-merges-finish,442901.125,ms error rate,wait-until-merges-finish,0.00,%

Candidate OSB Benchmark numbers

Metric,Task,Value,Unit Cumulative indexing time of primary shards,,123.06613333333333,min Min cumulative indexing time across primary shards,,123.06613333333333,min Median cumulative indexing time across primary shards,,123.06613333333333,min Max cumulative indexing time across primary shards,,123.06613333333333,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,46.27821666666667,min Cumulative merge count of primary shards,,50, Min cumulative merge time across primary shards,,46.27821666666667,min Median cumulative merge time across primary shards,,46.27821666666667,min Max cumulative merge time across primary shards,,46.27821666666667,min Cumulative merge throttle time of primary shards,,7.507949999999999,min Min cumulative merge throttle time across primary shards,,7.507949999999999,min Median cumulative merge throttle time across primary shards,,7.507949999999999,min Max cumulative merge throttle time across primary shards,,7.507949999999999,min Cumulative refresh time of primary shards,,18.536366666666666,min Cumulative refresh count of primary shards,,71, Min cumulative refresh time across primary shards,,18.536366666666666,min Median cumulative refresh time across primary shards,,18.536366666666666,min Max cumulative refresh time across primary shards,,18.536366666666666,min Cumulative flush time of primary shards,,19.153166666666667,min Cumulative flush count of primary shards,,29, Min cumulative flush time across primary shards,,19.153166666666667,min Median cumulative flush time across primary shards,,19.153166666666667,min Max cumulative flush time across primary shards,,19.153166666666667,min Total Young Gen GC time,,4.504,s Total Young Gen GC count,,171, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,29.614024320617318,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0.0,MB Heap used for doc values,,0.0,MB Heap used for terms,,0.0,MB Heap used for norms,,0.0,MB Heap used for points,,0.0,MB Heap used for stored fields,,0.0,MB Segment count,,40, Min Throughput,index,85975.98,docs/s Mean Throughput,index,87392.35,docs/s Median Throughput,index,87087.23,docs/s Max Throughput,index,88569.62,docs/s 50th percentile latency,index,534.6113431249195,ms 90th percentile latency,index,586.4924298711669,ms 99th percentile latency,index,698.2753403727203,ms 99.9th percentile latency,index,982.9781390380899,ms 99.99th percentile latency,index,1113.718198718254,ms 100th percentile latency,index,1117.2764892578125,ms 50th percentile service time,index,534.6113431249195,ms 90th percentile service time,index,586.4924298711669,ms 99th percentile service time,index,698.2753403727203,ms 99.9th percentile service time,index,982.9781390380899,ms 99.99th percentile service time,index,1113.718198718254,ms 100th percentile service time,index,1117.2764892578125,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.01,ops/s Mean Throughput,wait-until-merges-finish,0.01,ops/s Median Throughput,wait-until-merges-finish,0.01,ops/s Max Throughput,wait-until-merges-finish,0.01,ops/s 100th percentile latency,wait-until-merges-finish,168270.359375,ms 100th percentile service time,wait-until-merges-finish,168270.359375,ms error rate,wait-until-merges-finish,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,128.66161666666667,min Min cumulative indexing time across primary shards,,128.66161666666667,min Median cumulative indexing time across primary shards,,128.66161666666667,min Max cumulative indexing time across primary shards,,128.66161666666667,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,46.34108333333334,min Cumulative merge count of primary shards,,56, Min cumulative merge time across primary shards,,46.34108333333334,min Median cumulative merge time across primary shards,,46.34108333333334,min Max cumulative merge time across primary shards,,46.34108333333334,min Cumulative merge throttle time of primary shards,,7.678283333333334,min Min cumulative merge throttle time across primary shards,,7.678283333333334,min Median cumulative merge throttle time across primary shards,,7.678283333333334,min Max cumulative merge throttle time across primary shards,,7.678283333333334,min Cumulative refresh time of primary shards,,17.384016666666664,min Cumulative refresh count of primary shards,,80, Min cumulative refresh time across primary shards,,17.384016666666664,min Median cumulative refresh time across primary shards,,17.384016666666664,min Max cumulative refresh time across primary shards,,17.384016666666664,min Cumulative flush time of primary shards,,18.99905,min Cumulative flush count of primary shards,,34, Min cumulative flush time across primary shards,,18.99905,min Median cumulative flush time across primary shards,,18.99905,min Max cumulative flush time across primary shards,,18.99905,min Total Young Gen GC time,,3.987,s Total Young Gen GC count,,169, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,29.63762593641877,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0.0,MB Heap used for doc values,,0.0,MB Heap used for terms,,0.0,MB Heap used for norms,,0.0,MB Heap used for points,,0.0,MB Heap used for stored fields,,0.0,MB Segment count,,35, Min Throughput,index,83083.31,docs/s Mean Throughput,index,84692.20,docs/s Median Throughput,index,84578.39,docs/s Max Throughput,index,86208.48,docs/s 50th percentile latency,index,553.293309343638,ms 90th percentile latency,index,605.4396925567702,ms 99th percentile latency,index,725.5726155598959,ms 99.9th percentile latency,index,949.3776007080145,ms 99.99th percentile latency,index,1044.9063393554366,ms 100th percentile latency,index,1061.515380859375,ms 50th percentile service time,index,553.293309343638,ms 90th percentile service time,index,605.4396925567702,ms 99th percentile service time,index,725.5726155598959,ms 99.9th percentile service time,index,949.3776007080145,ms 99.99th percentile service time,index,1044.9063393554366,ms 100th percentile service time,index,1061.515380859375,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,0.01,ops/s Mean Throughput,wait-until-merges-finish,0.01,ops/s Median Throughput,wait-until-merges-finish,0.01,ops/s Max Throughput,wait-until-merges-finish,0.01,ops/s 100th percentile latency,wait-until-merges-finish,158645.5625,ms 100th percentile service time,wait-until-merges-finish,158645.5625,ms error rate,wait-until-merges-finish,0.00,% Metric,Task,Value,Unit Cumulative indexing time of primary shards,,133.88021666666666,min Min cumulative indexing time across primary shards,,133.88021666666666,min Median cumulative indexing time across primary shards,,133.88021666666666,min Max cumulative indexing time across primary shards,,133.88021666666666,min Cumulative indexing throttle time of primary shards,,0,min Min cumulative indexing throttle time across primary shards,,0,min Median cumulative indexing throttle time across primary shards,,0,min Max cumulative indexing throttle time across primary shards,,0,min Cumulative merge time of primary shards,,41.400999999999996,min Cumulative merge count of primary shards,,56, Min cumulative merge time across primary shards,,41.400999999999996,min Median cumulative merge time across primary shards,,41.400999999999996,min Max cumulative merge time across primary shards,,41.400999999999996,min Cumulative merge throttle time of primary shards,,5.022716666666667,min Min cumulative merge throttle time across primary shards,,5.022716666666667,min Median cumulative merge throttle time across primary shards,,5.022716666666667,min Max cumulative merge throttle time across primary shards,,5.022716666666667,min Cumulative refresh time of primary shards,,17.759,min Cumulative refresh count of primary shards,,80, Min cumulative refresh time across primary shards,,17.759,min Median cumulative refresh time across primary shards,,17.759,min Max cumulative refresh time across primary shards,,17.759,min Cumulative flush time of primary shards,,18.6779,min Cumulative flush count of primary shards,,34, Min cumulative flush time across primary shards,,18.6779,min Median cumulative flush time across primary shards,,18.6779,min Max cumulative flush time across primary shards,,18.6779,min Total Young Gen GC time,,4.165,s Total Young Gen GC count,,184, Total Old Gen GC time,,0,s Total Old Gen GC count,,0, Store size,,23.72135942708701,GB Translog size,,5.122274160385132e-08,GB Heap used for segments,,0.0,MB Heap used for doc values,,0.0,MB Heap used for terms,,0.0,MB Heap used for norms,,0.0,MB Heap used for points,,0.0,MB Heap used for stored fields,,0.0,MB Segment count,,37, Min Throughput,index,80130.51,docs/s Mean Throughput,index,82038.78,docs/s Median Throughput,index,81694.03,docs/s Max Throughput,index,83350.70,docs/s 50th percentile latency,index,574.5708901288535,ms 90th percentile latency,index,624.3016249423556,ms 99th percentile latency,index,747.114295532227,ms 99.9th percentile latency,index,974.6724151000996,ms 99.99th percentile latency,index,1057.6637132079989,ms 100th percentile latency,index,1061.8780517578125,ms 50th percentile service time,index,574.5708901288535,ms 90th percentile service time,index,624.3016249423556,ms 99th percentile service time,index,747.114295532227,ms 99.9th percentile service time,index,974.6724151000996,ms 99.99th percentile service time,index,1057.6637132079989,ms 100th percentile service time,index,1061.8780517578125,ms error rate,index,0.00,% Min Throughput,wait-until-merges-finish,140.38,ops/s Mean Throughput,wait-until-merges-finish,140.38,ops/s Median Throughput,wait-until-merges-finish,140.38,ops/s Max Throughput,wait-until-merges-finish,140.38,ops/s 100th percentile latency,wait-until-merges-finish,6.727624893188477,ms 100th percentile service time,wait-until-merges-finish,6.727624893188477,ms error rate,wait-until-merges-finish,0.00,%

There's 20% improvement in throughput, 10% average latency and 80% tail latency(p99) improvements

What are your thoughts on the numbers and overall idea of disabling flush param now @mgodwan / @backslasht / @shwetathareja ?

Next steps: I'm currently benchmarking with a custom workload where we can test parallel indexing and search performance. I expect search performance to be a lot better in this case while indexing is going on

Thanks @CaptainDredge for the numbers. Looking forward to seeing how search is impacted as well.

Can you highlight how checkPendingFlushUpdate comes into picture to create segments i.e. the thresholds and frequency at which it operates?

For the lock profiles, I see there is an increased time being blocked on LocalCheckpointTracker. Do we know which threads are blocked on this object?
Total flush time has increased as well while the number of flushes have reduced. Do you see similar impact on merges as well? I think that if flushes are reduced, it would also cause large segments, and reduction in merges as well. Also, do you see any impact on the variations of segment sizes created with this change?
Do you see any trade-offs with offloading the flushes off write pool here? I see the work done based on attached flame graphs to be similar (albeit on different threads) but the throughput is still high which can be attributed to significantly lower tail latencies due to reduced flush activity in the write path.

Thanks for the benchmarks @CaptainDredge , the nos. definitely look encouraging. Its a good surprise to see overall throughput improvement as well. We should understand better if over throughput improvement is coming from lesser segments and in turn less merges combined with lower latencies.

Few more question to understand the results better:

why the p50 latencies are improved in candidate? flush code will only be triggered if flush is lagging behind.
why should total flush time increase with candidate?

Also on the trade off, the instances where the disk is slow and flushes are impacted (and can get piled up), what will happen in those cases? Currently, write thread help to catch up. At some point we will need to throttle the indexing?

Also on the trade off, the instances where the disk is slow and flushes are impacted (and can get piled up), what will happen in those cases? Currently, write thread help to catch up. At some point we will need to throttle the indexing?

I believe IndexingMemoryController should kick in to throttle the writes in such a case. https://github.com/opensearch-project/OpenSearch/blob/b4306b624e5546dd102a4ed45470564b31d5f3a0/server/src/main/java/org/opensearch/indices/IndexingMemoryController.java#L434-L436

That said, it will be helpful to see the benchmark behavior this change brings in when the resource under contention is disk IOPS/throughput.

I benchmarked with a custom workload with total uncompressed size of around 600Gb distributed equally among 3 indices. It Indexes the 50% document corpus initially. Document ids are unique so all index operations are append only. After, ingestion occurs again with remaining 5[0]% document corpus along with a couple of filter queries and aggregations which are run parallely. I used generous warmup time period for ingestion ~ 30 mins and good amount of warmup iteration ~ 100 for search queries. No explicit refresh interval was set

Baseline: checkPendingFlushOnUpdate = true Candidate: checkPendingFlushOnUpdate = false

This includes one optimization which I identified through lock profiling where we short circuited a condition to prevent write threads getting blocked to check if the size of pending flushes queue > 0. Opened the issue in lucene for the same. This overall brought an increase of 7% throughput and 6-7% latency improvement. Hereafter the results contain this optimization as well

Test procedure: Initial index append similar to nyc index-append-shortcircuit-results

Test Procedure: Concurrent Index append i.e. ingesting 5% doc corpus parallely with search concurrent-index-append-short-circuit

Test Procedure: Filter queries running parallely with ingestion filter-queries-shortcircuit

Test Procedure: Aggregation queries run parallely with ingestion aggregation-queries-shortcircuit

Overall Flush metrics flush-metrics-shortcircuit

Average Latency: Left of Separator -> Canidate, Right of separator -> Baseline

Flush Sizes(Mb): Flush sizes are higher in candidate which means segment sizes are larger

Baseline: flush-size-baseline

Candidate: flush-size-candidate

Flush Time(ms) : Flush times are higher in candidate since large segments are now getting flushed to disk at a time

Baseline: flush-time-baseline

Candidate: flush-time-candidate

Number of flushes completed over time is still same in both baseline and candidate

Notable conclusions from the experiment

Throughput for both indexing and search has been consistently higher in candidate across the runs
Tail indexing latency(p99) have definitely improved but we saw a degradation in p50 and p90 latencies. Average latencies is less in candidate which is leading to higher throughput. Tail latencies are so less that they are pushing the average down significantly.
Aggregation and Filter queries latency have improved significantly in candidate as well there is an increase in throughput

Answering few Important Qns

Why we're seeing significant improvement in throughput? write threads are now free from doing flushes and hence more cpu bandwidth is available for indexing which explains increase in indexing throughput. Search througput increase is explained by a decrease in search latency due to larger size segments i.e. a single segment now contains more docs
why the p50 latencies are degraded in candidate for indexing? flush code will only be triggered if flush is lagging behind. I was not able to explain p50 degradation through all the tracked metrics and async profiles. I tried a couple of things like reducing locking at certain places but the improvement of latency were not so much to be better than baseline. Currently, my hypothesis is that since larger segments are being made due to higher indexing throughput, the buffer allocations for them are taking more time. I'm trying to validate this hypothesis.
why should total flush time increase with candidate? Larger segments are getting flushed to disk now so it takes more time to write. Its a classic latency Vs throughput tradeoff where we write larger chunks to disk throughput on the expense of increased latency
Will this cause flush backlog? This can definitely cause a pile of flushes because write threads are not helping in flushing now but when the flush backlog exceeds certain memory thresholds IndexingMemoryController or dwpt flush control throttling will kick in depending on whether the flush backlog is on a node or shard level which will block writes to catchup on flushing
Will this be detrimental to search freshness? Yes since the segment will be available for searching after a comparatively longer time since the flushing is taking more time, Disabling write threads participation entirely is a naive optimisation on which we've to improve. We should ideally start helping with flushing somewhere between when flushing starts and throttling starts to restore to old behaviour when flushing is slow.

Thanks @CaptainDredge for the detailed analysis.

I see that we're seeing constant improvement in indexing throughput and tail latencies (as expected since write threads can now work on actual indexing instead of helping the refresh operation for flushing segments) but at the same time, P50 latency seems to have degraded for indexing (as you've pointed out). Do we understand the cause behind it now? It may help to add some trace logging in the active indexing path (i.e. queuing + processing within lucene upto IndexingChain#processDocuments to understand this further) Please update when you're able to valid the hypothesis you've suggested as well.

Search througput increase is explained by a decrease in search latency due to larger size segments i.e. a single segment now contains more docs

Is there any metric confirming this hypoethesis? Did you track the segment sizes over time for your concurrent indexing-search workload?

Larger segments are getting flushed to disk now so it takes more time to write.

Doesn't it also reduce the number of segments written? Will cumulative flush time reported by OSB include the latency for flush operation to finish which will now not be helped on by write threads, and reduce the concurrency and leading to higher perceived latency?

This can definitely cause a pile of flushes because write threads are not helping in flushing

Will this cause an increase in the overall time it takes for documents to be visible for search as well? Is there a hybrid approach we can try where we let refresh thread take care of largest segments whereas write threads still continue to help with smaller segments so that the additional work write threads are doing is reduced through some setting (compared to 0 in the current experiments).

Also, since refreshes for shards of a single index on a node are serailized, if we plan to incorporate this change, we should look into better scheduling of refreshes across shards as well (while ensuring fairness). In case of multiple shards for the same index on a node, we may be able to reduce index level lag by moving the concurrency to a higher level.

Do we understand the cause behind it now? It may help to add some trace logging in the active indexing path (i.e. queuing + processing within lucene upto IndexingChain#processDocuments to understand this further) Please update when you're able to valid the hypothesis you've suggested as well.

I did a per request trace analysis to find out the p50 time differences we were seeing in candidate and baseline. I divided the indexing code flow path in 3 mutually exclusive parts 1.) IndexIntoLucene ( this function is the one which delegates indexing to lucene index writer for indexing) 2. translog#Add operation 3. MarkSeqNoAsProcessed. Initially the suspects were translog add and markSeqNo as processed operation because a higher locking time was being observed in collected JFR profiles but after adding log tracing on a per request basis their contribution to added p50 latency was minimal although we still see them consuming a little more time in candidate i.e. pendingFlushCheck disabled case. Here are the per request comparison graphs for these 3 parts

shardBulkAction time distribution:

ShardBulkAction p50 latency: baseline=243440810.0, candidate=259049053.0, % diff=6.411514568982908

IndexIntoLucene time distribution:

Index Into Lucene p50 latency: baseline=243440810.0, candidate=259049053.0, % diff=10.580723364219258

Translog Add time distribution:

Translog#Add p50 latency: baseline=9797628.0, candidate=10864336.5, % diff=10.88741581125554

MarkSeqNo time distribution

MarkSeqNoAsProcessed p50 latency: baseline=2603495.0, candidate=2651995.0, % diff=1.862880474131888

From above charts mainly lucene writer is taking more time ( ~ 10%) to complete bulk indexing request which explains the p50 latency differences. Its not feasible currently to add request level tracing at lucene level for further exploration of what operations are taking more time at index writer level to complete.

Few other interesting points to note are

Translog add operation is 4% slower in candidate but since it contributes to only 10% of shardBulk request time, it can at max affect 0.4% of the overall p50 latency
Per doc level tracing revealed p50 indexing latency per doc is comparable in both. baseline and candidate

The above findings are for the custom workload we had which. does concurrent indexing and search. I experimented further with lesser refresh interval 1s and found out that this gives us the expected throughput improvements with improved tail latency and similar p50 latencies.

This should be helpful in many cases for that I've raised a PR to expose this setting here https://github.com/opensearch-project/OpenSearch/pull/12710

Addressing concerns around refresh lags:

Will this cause an increase in the overall time it takes for documents to be visible for search as well? Is there a hybrid approach we can try where we let refresh thread take care of largest segments whereas write threads still continue to help with smaller segments so that the additional work write threads are doing is reduced through some setting (compared to 0 in the current experiments).

I traced async refresh action to find out if we see added lag in refreshes in the candidate case and I found it to be similar in both the cases

Refresh p50 lag: baseline=31196.0, candidate=28931.0, % diff=-7.260546223874856

cc: @mgodwan

opensearch-project / OpenSearch