SearchPerfTest should track / aggregate total true CPU cycles spent running a query

Today luceneutil reports the effective QPS as 1.0 / wall-clock-time. But when using intra-query concurrency, that's a lie (will be too high) since multiple cores are running at once.

Let's change that to use the JMX bean to measure per-thread actual CPU cycles, and carefully aggregate across all threads that run for the query, to compute the aggregated CPU time spent on the query, and translate to effective QPS of 1.0/aggregated-cpu-time.

This should be more accurate, and remove the false results we see when using intra-query concurrency, assuming the JMX bean is accurate. And it'd mean we can go back to running both inter- and intra-query concurrency at the same time, and our benchmarking runs will finish quicker.

To implement this, I think we must make a wrapped ThreadPoolExecutor to pass to IndexSearcher that tracks which query each work unit (slice+query) corresponds to, and aggregate CPU time across the N threads per-query accordingly...

I'm trying to find a simple/clean way to do this accounting and it does not look simple. The Executor that's passed to IndexSearcher gets very opaque work units (Callable) that cannot easily be associated back to the original Query + coordinator thread. I thought if we could open up TaskExecutor, maybe we could make a wrapper for that which tracks which thread is calling .invokeAll (which would be the same thread that has called IndexSearcher.search) and we could track the CPU accounting this way. But this is an invasive change to Lucene's sources ...

We might also make a CPUTrackingQuery wrapper that wraps any Query and down through Weight, ScorerSupplier and then tries to measure start/end (hmm not sure how "end" could be measured -- there is no .finish called per leaf Scorer?). And anyway this feels too hacky/heavyweight.

Maybe @jpountz has some simple idea :)

Would it really be so terrible to add some accounting to Lucene's executor implementation? I thought about using Collector API (which does have a start when LeafCollector gets created and then a finish() later), but this doesn't capture all the work we do.

I understand why you are trying to measure CPU time, but it would have other issues. E.g. it would not report issues if IndexSearcher#slices or the merge policy somehow create segments that are not good for search parallelism. I am tempted to keep things as they are and report a QPS which is essentially a QPS under a concurrency of 1 from the user's perspective. After all, it's not that uncommon for user-facing search systems to be overprovisioned enough that you don't have many concurrent searches running at the same time. So it would still measure something meaningful.

Maybe we should rather look into running search tasks twice: once with an executor of size 1, which should be an ok proxy for throughput, and once with an executor of size N>1 (e.g. N=8 like you suggested) to measure Lucene's ability to take advantage of "free" CPUs to improve latency.

Yeah +1 to measure the speedup we see due to concurrency "working" well too.

But using a single thread as a proxy for throughput wouldn't measure cases where Lucene can save CPU by better cross-segment comparisons? Or, perhaps, those improvements might come through (ish?) by measuring effective QPS using intra-query concurrency? But, not always, like if the long pole slice for a segment still takes X CPU, but because of an innovation, other slices were able to terminate sooner / skip better, the QPS would not change but total CPU improved.

Maybe the best overall solution for measuring "Lucene CPU improvements" is to build a true red-line test... why try to build tricky proxy metrics heh :)

I've made some progress here ...

I added an "actual QPS" measure, recorded by SearchPerfTest and reported for each iteration of each competitor. It discards warmup time (currently hardwired to 5 seconds), and also discards wind-down time (when the first thread notices no more tasks in the queue). Within that time it measures the actual QPS achieved, and it also taps into JMX metrics over that same period to report actual CPU utilization.

It looks like this:

Search:
  iter 0
    base2:
      log: /l/logs/atoa.base2.0 + stdout
      run: /usr/bin/perf stat -dd /usr/lib/jvm/java-21-openjdk/bin/java --add-modules jdk.incubator.vector -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/python/profi\
ling.jfc,filename=/l/logs/bench-search-atoa-base2-0.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/\
classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/java/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/class\
es/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/trunk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/l/trunk/lucene/join/build/classes/java/main:/l/util/\
lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/mikes-vector-test -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/1876.tasks -numConcurrentQueries 12 -taskRepeatCount 500 -field body -tasksPerCat 5 -\
searchConcurrency 16 -staticSeed -596438 -seed -5161207 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs/atoa.base2.0 -topN 100 -vectorFile /lucenedata/enwiki/cohere-wikipedia-queries-768.vec -vectorDimension 768
      63.5 s
      254.5 QPS; 15.8 CPU cores
    base1:
      log: /l/logs/atoa.base1.0 + stdout
      run: /usr/bin/perf stat -dd /usr/lib/jvm/java-21-openjdk/bin/java --add-modules jdk.incubator.vector -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/python/profi\
ling.jfc,filename=/l/logs/bench-search-atoa-base1-0.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/\
classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/java/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/class\
es/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/trunk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/l/trunk/lucene/join/build/classes/java/main:/l/util/\
lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/mikes-vector-test -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/1876.tasks -numConcurrentQueries 12 -taskRepeatCount 500 -field body -tasksPerCat 5 -\
searchConcurrency 32 -staticSeed -596438 -seed -5161207 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs/atoa.base1.0 -topN 100 -vectorFile /lucenedata/enwiki/cohere-wikipedia-queries-768.vec -vectorDimension 768
      57.7 s
      286.6 QPS; 30.0 CPU cores

In this case, base used searchConcurrency=16 and comp used searchConcurrency=32 (on a 32 core Raptorlake box). It's curious that doubling the search concurrency yielded only smallish improvement in QPS, and that we were not actually able to saturate all 32 cores (just 30). I think this is showing that hyperthreaded cores are not as good as "real" cores :)

Note that these numbers not necessarily a red-line QPS measurement, since it is up to the test runner to configure concurrency sufficiently to saturate CPU cores or IO or some other bottleneck. It is simply the actual QPS that the test achieved ... when testing for query latency (the per-task "effective QPS" we normally report) one should not run near red-line if they want accurate approximation of the total CPU cost of each query.

Next step I'll try to add a red-line QPS to nightly benchmarks ...

I ran this same perf.py on beast3 which has many cores (64 actual, 128 with HT). Curiously QPS also doesn't improve so much when I double searchConcurrency from 16 to 32:

    base2:
      log: /l/logs/atoa.base2.0 + stdout
      run: /usr/bin/perf stat -dd /usr/lib/jvm/java-21-openjdk/bin/java --add-modules jdk.incubator.vector -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/python/profi\
ling.jfc,filename=/l/logs/bench-search-atoa-base2-0.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/\
classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/java/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/class\
es/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/trunk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/l/trunk/lucene/join/build/classes/java/main:/l/util/\
lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/mikes-vector-test -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/408330.tasks -numConcurrentQueries 24 -taskRepeatCount 500 -field body -tasksPerCat 5\
 -searchConcurrency 16 -staticSeed -9982474 -seed 882824 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs/atoa.base2.0 -topN 100 -vectorFile /lucenedata/enwiki/cohere-wikipedia-queries-768.vec -vectorDimension 768
      105.9 s
      135.6 actual sustained QPS; 15.5 CPU cores used
    base1:
      log: /l/logs/atoa.base1.0 + stdout
      run: /usr/bin/perf stat -dd /usr/lib/jvm/java-21-openjdk/bin/java --add-modules jdk.incubator.vector -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp -XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings=/l/util/src/python/profi\
ling.jfc,filename=/l/logs/bench-search-atoa-base1-0.jfr -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -classpath /l/trunk/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/l/trunk/lucene/sandbox/build/classes/java/main:/l/trunk/lucene/misc/build/\
classes/java/main:/l/trunk/lucene/facet/build/classes/java/main:/l/trunk/lucene/analysis/common/build/classes/java/main:/l/trunk/lucene/analysis/icu/build/classes/java/main:/l/trunk/lucene/queryparser/build/classes/java/main:/l/trunk/lucene/grouping/build/class\
es/java/main:/l/trunk/lucene/suggest/build/classes/java/main:/l/trunk/lucene/highlighter/build/classes/java/main:/l/trunk/lucene/codecs/build/classes/java/main:/l/trunk/lucene/queries/build/classes/java/main:/l/trunk/lucene/join/build/classes/java/main:/l/util/\
lib/HdrHistogram.jar:/l/util/build perf.SearchPerfTest -dirImpl MMapDirectory -indexPath /l/indices/mikes-vector-test -analyzer StandardAnalyzerNoStopWords -taskSource /l/util/408330.tasks -numConcurrentQueries 24 -taskRepeatCount 500 -field body -tasksPerCat 5\
 -searchConcurrency 32 -staticSeed -9982474 -seed 882824 -similarity BM25Similarity -commit multi -hiliteImpl FastVectorHighlighter -log /l/logs/atoa.base1.0 -topN 100 -vectorFile /lucenedata/enwiki/cohere-wikipedia-queries-768.vec -vectorDimension 768
      94.4 s
      153.5 actual sustained QPS; 31.2 CPU cores used

Note that numConcurrentQueries (inter-query concurrency) is 24 for both, but base1 uses (intra-query concurrency) searchConcurrency=32 and base2 uses searchConcurrency=16. CPU is close to saturated for both (15.5 and 31.2).

The top-level perf stat -dd output is also different but I'm not sure how to read these tea leaves:

      1,483,709.92 msec task-clock:u                     #   14.025 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
         2,105,049      page-faults:u                    #    1.419 K/sec
 5,996,100,816,144      cycles:u                         #    4.041 GHz                         (46.15%)
   244,262,036,245      stalled-cycles-frontend:u        #    4.07% frontend cycles idle        (46.15%)
 4,786,032,219,720      instructions:u                   #    0.80  insn per cycle
                                                  #    0.05  stalled cycles per insn     (46.16%)
   901,187,832,360      branches:u                       #  607.388 M/sec                       (46.15%)
    14,052,833,023      branch-misses:u                  #    1.56% of all branches             (46.16%)
 2,292,118,658,161      L1-dcache-loads:u                #    1.545 G/sec                       (46.17%)
    13,687,637,586      L1-dcache-load-misses:u          #    0.60% of all L1-dcache accesses   (46.19%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u
   446,194,914,186      L1-icache-loads:u                #  300.729 M/sec                       (46.20%)
     1,082,631,173      L1-icache-load-misses:u          #    0.24% of all L1-icache accesses   (46.20%)
       991,011,381      dTLB-loads:u                     #  667.928 K/sec                       (46.21%)
       224,286,061      dTLB-load-misses:u               #   22.63% of all dTLB cache accesses  (46.19%)
       271,842,758      iTLB-loads:u                     #  183.218 K/sec                       (46.18%)
         9,283,690      iTLB-load-misses:u               #    3.42% of all iTLB cache accesses  (46.17%)

     105.792545228 seconds time elapsed

    1450.819877000 seconds user
      24.921433000 seconds sys

      2,611,052.96 msec task-clock:u                     #   27.686 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
           903,627      page-faults:u                    #  346.078 /sec
10,635,315,205,809      cycles:u                         #    4.073 GHz                         (46.18%)
   279,271,705,890      stalled-cycles-frontend:u        #    2.63% frontend cycles idle        (46.17%)
 5,261,455,740,117      instructions:u                   #    0.49  insn per cycle
                                                  #    0.05  stalled cycles per insn     (46.17%)
 1,012,867,205,937      branches:u                       #  387.915 M/sec                       (46.18%)
    16,174,443,565      branch-misses:u                  #    1.60% of all branches             (46.17%)
 2,493,286,941,321      L1-dcache-loads:u                #  954.897 M/sec                       (46.16%)
    16,163,950,854      L1-dcache-load-misses:u          #    0.65% of all L1-dcache accesses   (46.16%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u
   484,332,516,968      L1-icache-loads:u                #  185.493 M/sec                       (46.15%)
     1,142,637,648      L1-icache-load-misses:u          #    0.24% of all L1-icache accesses   (46.16%)
       974,936,275      dTLB-loads:u                     #  373.388 K/sec                       (46.15%)
       237,337,214      dTLB-load-misses:u               #   24.34% of all dTLB cache accesses  (46.16%)
       264,616,306      iTLB-loads:u                     #  101.345 K/sec                       (46.17%)
        30,662,925      iTLB-load-misses:u               #   11.59% of all iTLB cache accesses  (46.18%)

      94.308019991 seconds time elapsed

    2568.112300000 seconds user
      29.953207000 seconds sys

This is on a somewhat degenerate index that has many slices (Reader has 62 slices, from 74 segments:) ... I think this is telling us that Lucene indeed spends more total CPU when there are more concurrent slices to work on, because we collect top K hits for every slice and then merge in the end. So net/net individual queries can indeed go faster (lower long-pole latencies), but Lucene is spending more total CPU to accomplish that. It means in a cluster, each node should use intra-query concurrency during peace time, but as the node approaches red-line capacity, it should (gradually?) turn that off to maximize throughput, maybe.

mikemccand / luceneutil

SearchPerfTest should track / aggregate total true CPU cycles spent running a query #277