Open mikemccand opened 3 months ago
I'm trying to find a simple/clean way to do this accounting and it does not look simple. The Executor
that's passed to IndexSearcher
gets very opaque work units (Callable
) that cannot easily be associated back to the original Query
+ coordinator thread. I thought if we could open up TaskExecutor
, maybe we could make a wrapper for that which tracks which thread is calling .invokeAll
(which would be the same thread that has called IndexSearcher.search
) and we could track the CPU accounting this way. But this is an invasive change to Lucene's sources ...
We might also make a CPUTrackingQuery
wrapper that wraps any Query
and down through Weight
, ScorerSupplier
and then tries to measure start/end (hmm not sure how "end" could be measured -- there is no .finish
called per leaf Scorer
?). And anyway this feels too hacky/heavyweight.
Maybe @jpountz has some simple idea :)
Would it really be so terrible to add some accounting to Lucene's executor implementation? I thought about using Collector API (which does have a start when LeafCollector gets created and then a finish() later), but this doesn't capture all the work we do.
I understand why you are trying to measure CPU time, but it would have other issues. E.g. it would not report issues if IndexSearcher#slices
or the merge policy somehow create segments that are not good for search parallelism. I am tempted to keep things as they are and report a QPS which is essentially a QPS under a concurrency of 1 from the user's perspective. After all, it's not that uncommon for user-facing search systems to be overprovisioned enough that you don't have many concurrent searches running at the same time. So it would still measure something meaningful.
Maybe we should rather look into running search tasks twice: once with an executor of size 1, which should be an ok proxy for throughput, and once with an executor of size N>1 (e.g. N=8 like you suggested) to measure Lucene's ability to take advantage of "free" CPUs to improve latency.
Yeah +1 to measure the speedup we see due to concurrency "working" well too.
But using a single thread as a proxy for throughput wouldn't measure cases where Lucene can save CPU by better cross-segment comparisons? Or, perhaps, those improvements might come through (ish?) by measuring effective QPS using intra-query concurrency? But, not always, like if the long pole slice for a segment still takes X CPU, but because of an innovation, other slices were able to terminate sooner / skip better, the QPS would not change but total CPU improved.
Maybe the best overall solution for measuring "Lucene CPU improvements" is to build a true red-line test... why try to build tricky proxy metrics heh :)
I've made some progress here ...
I added an "actual QPS" measure, recorded by SearchPerfTest
and reported for each iteration of each competitor. It discards warmup time (currently hardwired to 5 seconds), and also discards wind-down time (when the first thread notices no more tasks in the queue). Within that time it measures the actual QPS achieved, and it also taps into JMX metrics over that same period to report actual CPU utilization.
It looks like this:
Search:
iter 0
base2:
log: /l/logs/atoa.base2.0 + stdout
run: /usr/bin/perf stat -dd /usr/lib/jvm/java-21-openjdk/bin/java --add-modules jdk.incubator.vector -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/python/profi\
ling.jfc,filename=/l/logs/bench-search-atoa-base2-0.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/\
classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/java/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/class\
es/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/trunk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/l/trunk/lucene/join/build/classes/java/main:/l/util/\
lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/mikes-vector-test -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/1876.tasks -numConcurrentQueries 12 -taskRepeatCount 500 -field body -tasksPerCat 5 -\
searchConcurrency 16 -staticSeed -596438 -seed -5161207 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs/atoa.base2.0 -topN 100 -vectorFile /lucenedata/enwiki/cohere-wikipedia-queries-768.vec -vectorDimension 768
63.5 s
254.5 QPS; 15.8 CPU cores
base1:
log: /l/logs/atoa.base1.0 + stdout
run: /usr/bin/perf stat -dd /usr/lib/jvm/java-21-openjdk/bin/java --add-modules jdk.incubator.vector -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/python/profi\
ling.jfc,filename=/l/logs/bench-search-atoa-base1-0.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/\
classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/java/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/class\
es/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/trunk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/l/trunk/lucene/join/build/classes/java/main:/l/util/\
lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/mikes-vector-test -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/1876.tasks -numConcurrentQueries 12 -taskRepeatCount 500 -field body -tasksPerCat 5 -\
searchConcurrency 32 -staticSeed -596438 -seed -5161207 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs/atoa.base1.0 -topN 100 -vectorFile /lucenedata/enwiki/cohere-wikipedia-queries-768.vec -vectorDimension 768
57.7 s
286.6 QPS; 30.0 CPU cores
In this case, base
used searchConcurrency=16
and comp
used searchConcurrency=32
(on a 32 core Raptorlake box). It's curious that doubling the search concurrency yielded only smallish improvement in QPS, and that we were not actually able to saturate all 32 cores (just 30). I think this is showing that hyperthreaded cores are not as good as "real" cores :)
Note that these numbers not necessarily a red-line QPS measurement, since it is up to the test runner to configure concurrency sufficiently to saturate CPU cores or IO or some other bottleneck. It is simply the actual QPS that the test achieved ... when testing for query latency (the per-task "effective QPS" we normally report) one should not run near red-line if they want accurate approximation of the total CPU cost of each query.
Next step I'll try to add a red-line QPS to nightly benchmarks ...
I ran this same perf.py
on beast3 which has many cores (64 actual, 128 with HT). Curiously QPS also doesn't improve so much when I double searchConcurrency
from 16 to 32:
base2:
log: /l/logs/atoa.base2.0 + stdout
run: /usr/bin/perf stat -dd /usr/lib/jvm/java-21-openjdk/bin/java --add-modules jdk.incubator.vector -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/python/profi\
ling.jfc,filename=/l/logs/bench-search-atoa-base2-0.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/\
classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/java/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/class\
es/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/trunk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/l/trunk/lucene/join/build/classes/java/main:/l/util/\
lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/mikes-vector-test -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/408330.tasks -numConcurrentQueries 24 -taskRepeatCount 500 -field body -tasksPerCat 5\
-searchConcurrency 16 -staticSeed -9982474 -seed 882824 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs/atoa.base2.0 -topN 100 -vectorFile /lucenedata/enwiki/cohere-wikipedia-queries-768.vec -vectorDimension 768
105.9 s
135.6 actual sustained QPS; 15.5 CPU cores used
base1:
log: /l/logs/atoa.base1.0 + stdout
run: /usr/bin/perf stat -dd /usr/lib/jvm/java-21-openjdk/bin/java --add-modules jdk.incubator.vector -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/python/profi\
ling.jfc,filename=/l/logs/bench-search-atoa-base1-0.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/\
classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/java/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/class\
es/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/trunk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/l/trunk/lucene/join/build/classes/java/main:/l/util/\
lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/mikes-vector-test -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/408330.tasks -numConcurrentQueries 24 -taskRepeatCount 500 -field body -tasksPerCat 5\
-searchConcurrency 32 -staticSeed -9982474 -seed 882824 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs/atoa.base1.0 -topN 100 -vectorFile /lucenedata/enwiki/cohere-wikipedia-queries-768.vec -vectorDimension 768
94.4 s
153.5 actual sustained QPS; 31.2 CPU cores used
Note that numConcurrentQueries
(inter-query concurrency) is 24 for both, but base1
uses (intra-query concurrency) searchConcurrency=32
and base2
uses searchConcurrency=16
. CPU is close to saturated for both (15.5 and 31.2).
The top-level perf stat -dd
output is also different but I'm not sure how to read these tea leaves:
1,483,709.92 msec task-clock:u # 14.025 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,105,049 page-faults:u # 1.419 K/sec
5,996,100,816,144 cycles:u # 4.041 GHz (46.15%)
244,262,036,245 stalled-cycles-frontend:u # 4.07% frontend cycles idle (46.15%)
4,786,032,219,720 instructions:u # 0.80 insn per cycle
# 0.05 stalled cycles per insn (46.16%)
901,187,832,360 branches:u # 607.388 M/sec (46.15%)
14,052,833,023 branch-misses:u # 1.56% of all branches (46.16%)
2,292,118,658,161 L1-dcache-loads:u # 1.545 G/sec (46.17%)
13,687,637,586 L1-dcache-load-misses:u # 0.60% of all L1-dcache accesses (46.19%)
<not supported> LLC-loads:u
<not supported> LLC-load-misses:u
446,194,914,186 L1-icache-loads:u # 300.729 M/sec (46.20%)
1,082,631,173 L1-icache-load-misses:u # 0.24% of all L1-icache accesses (46.20%)
991,011,381 dTLB-loads:u # 667.928 K/sec (46.21%)
224,286,061 dTLB-load-misses:u # 22.63% of all dTLB cache accesses (46.19%)
271,842,758 iTLB-loads:u # 183.218 K/sec (46.18%)
9,283,690 iTLB-load-misses:u # 3.42% of all iTLB cache accesses (46.17%)
105.792545228 seconds time elapsed
1450.819877000 seconds user
24.921433000 seconds sys
vs
2,611,052.96 msec task-clock:u # 27.686 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
903,627 page-faults:u # 346.078 /sec
10,635,315,205,809 cycles:u # 4.073 GHz (46.18%)
279,271,705,890 stalled-cycles-frontend:u # 2.63% frontend cycles idle (46.17%)
5,261,455,740,117 instructions:u # 0.49 insn per cycle
# 0.05 stalled cycles per insn (46.17%)
1,012,867,205,937 branches:u # 387.915 M/sec (46.18%)
16,174,443,565 branch-misses:u # 1.60% of all branches (46.17%)
2,493,286,941,321 L1-dcache-loads:u # 954.897 M/sec (46.16%)
16,163,950,854 L1-dcache-load-misses:u # 0.65% of all L1-dcache accesses (46.16%)
<not supported> LLC-loads:u
<not supported> LLC-load-misses:u
484,332,516,968 L1-icache-loads:u # 185.493 M/sec (46.15%)
1,142,637,648 L1-icache-load-misses:u # 0.24% of all L1-icache accesses (46.16%)
974,936,275 dTLB-loads:u # 373.388 K/sec (46.15%)
237,337,214 dTLB-load-misses:u # 24.34% of all dTLB cache accesses (46.16%)
264,616,306 iTLB-loads:u # 101.345 K/sec (46.17%)
30,662,925 iTLB-load-misses:u # 11.59% of all iTLB cache accesses (46.18%)
94.308019991 seconds time elapsed
2568.112300000 seconds user
29.953207000 seconds sys
This is on a somewhat degenerate index that has many slices (Reader has 62 slices, from 74 segments:
) ... I think this is telling us that Lucene indeed spends more total CPU when there are more concurrent slices to work on, because we collect top K hits for every slice and then merge in the end. So net/net individual queries can indeed go faster (lower long-pole latencies), but Lucene is spending more total CPU to accomplish that. It means in a cluster, each node should use intra-query concurrency during peace time, but as the node approaches red-line capacity, it should (gradually?) turn that off to maximize throughput, maybe.
Today
luceneutil
reports the effective QPS as 1.0 / wall-clock-time. But when using intra-query concurrency, that's a lie (will be too high) since multiple cores are running at once.Let's change that to use the JMX bean to measure per-thread actual CPU cycles, and carefully aggregate across all threads that run for the query, to compute the aggregated CPU time spent on the query, and translate to effective QPS of 1.0/aggregated-cpu-time.
This should be more accurate, and remove the false results we see when using intra-query concurrency, assuming the JMX bean is accurate. And it'd mean we can go back to running both inter- and intra-query concurrency at the same time, and our benchmarking runs will finish quicker.
To implement this, I think we must make a wrapped
ThreadPoolExecutor
to pass toIndexSearcher
that tracks which query each work unit (slice+query) corresponds to, and aggregate CPU time across the N threads per-query accordingly...