Close traps when testing concurrent search

mikemccand commented 3 months ago

Spinoff from https://github.com/apache/lucene/pull/13472 where we thought we saw massive gains from an improvement to how Lucene implements intra-query concurrency, but, at least some of the gains were due to luceneutil trappiness. Some ideas:

Rename Competitor's numThreads to numConcurrentQueries (or maybe numQueriesInFlight?) to make it clear this controls how many concurrent queries are sent to IndexSearcher
If searchConcurrency > 1 maybe we should not allow numThreads>1? This is so trappy because the reported QPS gain from running concurrently is trappily divided by numThreads (see details from this comment)
Maybe we could add a true red-line (capacity) test to luceneutil? This would drive enough threads (concurrent queries or worker thread pool) to saturate CPU and/or IO. This is a tricky change tho ... we should open a separate spinoff. E.g. the precise set of tasks matters, their mix and order matters, reproducible arrival (Poisson-like) times matter, etc.
Let's fix the default thread counts so that luceneutil runs enough concurrent things (queries or work units) such that the tests runs as efficiently as possible, but, does not bump up against false contention on the box (i.e. does NOT saturate CPU or IO) that mess up the results
Maybe luceneutil should not even report QPS but rather latency? It's confusing because it is an "effective QPS" (1/avg-of-median-query-wall-clock-time-after-discarding-warmup-and-outliers) not an actual "red-line capacity".

Also, strangely, if searchConcurrency > 1, we oddly group the tasks so that all tasks within a single cat are done at the same time (across all threads). This is weird e.g. the Browse* tasks are single threaded even if you ask for concurrency and so top will show only 100% CPU used if you are running with six threads ...

mikemccand commented 3 months ago

It's the last boolean argument to LocalTaskSource ctor which groups all tasks by category together (running them all sequentially within each thread) when concurrency is enabled: https://github.com/mikemccand/luceneutil/commit/87a806341b008e959376ab0f1c8cfc0997a07d7a

msokolov commented 3 months ago

Grouping was added here: https://github.com/mikemccand/luceneutil/pull/56 because without it the QPS measurement per task was meaningless

mikemccand commented 3 months ago

Grouping was added here: #56 because without it the QPS measurement per task was meaningless

OH! Thanks for digging @msokolov ... hmm I think this is why we have the staticSeed: it would force same subset of tasks to be selected for each run (if I'm understanding the original issue correctly).

QPS is such a tricky metric ...

msokolov commented 3 months ago

OH! Thanks for digging @msokolov ... hmm I think this is why we have the staticSeed: it would force same subset of tasks to be selected for each run (if I'm understanding the original issue correctly).

It's been a while but I think the issue was that we want to measure wall clock time per task This works OK when we run single threaded since the times in the log represent times spent working on that single task. However when we run multiple tasks concurrently the wall clock times represent work done on all the tasks we are running, which can be of mixed types. So this grouping change ensures that we run only tasks of the same type at the same time, enabling us to attribute the wall clock times in the logs to that task unambiguously.

msokolov commented 3 months ago

I guess we could switch from measuring wall clock time to measuring CPU time using Java's JMX API. If we did that we wouldn't need to group tasks this way, and would possibly get a more realistic task mix in order to simulate cross-task effects in a mixed query load.

mikemccand commented 3 months ago

I guess we could switch from measuring wall clock time to measuring CPU time using Java's JMX API. If we did that we wouldn't need to group tasks this way, and would possibly get a more realistic task mix in order to simulate cross-task effects in a mixed query load.

+1 to explore this! This would be better than elapsed wall clock time... e.g. this would've prevented the whole confusion around lower QPS because multiple queries are running concurrently intra and inter query concurrency ...

mikemccand commented 3 months ago

The thing that caused me to dig into this particular issue was watching the java process running SearchPerfTest with six worker threads spin at 100% CPU for quite a while (10 seconds maybe) before then jumping up to that (correct, saturated) 600% CPU ... when I pulled the thread stacks I could see that all threads at the start were working on the same single-threaded Browse facet counting tasks. So I was baffled why all threads were working on just that one task for all the initial tasks ...

mikemccand commented 3 months ago

Hmm OK I tried reverting that change, so we no longer group by task category:

diff --git a/src/main/perf/SearchPerfTest.java b/src/main/perf/SearchPerfTest.java
index e77ceec..9983092 100755
--- a/src/main/perf/SearchPerfTest.java
+++ b/src/main/perf/SearchPerfTest.java
@@ -560,7 +560,7 @@ public class SearchPerfTest {
       final int taskRepeatCount = args.getInt("-taskRepeatCount");
       final int numTaskPerCat = args.getInt("-tasksPerCat");
       tasks = new LocalTaskSource(indexState, tasksFile, taskParserFactory.getTaskParser(), staticRandom, random,
-              numTaskPerCat, taskRepeatCount, doPKLookup, searchConcurrency != 0);
+              numTaskPerCat, taskRepeatCount, doPKLookup, false);
       System.out.println("Task repeat count " + taskRepeatCount);
       System.out.println("Tasks file " + tasksFile);
       System.out.println("Num task per cat " + numTaskPerCat);

Then I ran this perf.py (6 worker threads for intra-query concurrency, only one query in flight at once, picking 20 examples of each category (i.e. 20 HighTerm queries, each of which will have very different qps)):

  sourceData = competition.sourceData('wikimediumall')
  comp = competition.Competition(taskRepeatCount=20, taskCountPerCat=5, verifyCounts=False)
  comp.addNotTaskPattern('BrowseDateSSDVFacets')
  # comp.addTaskPattern('BrowseDayOfYearTaxoFacets')                                                                                                                                                                                          
  # comp.addTaskPattern('OrHighHigh')                                                                                                                                                                                                         
  comp.addTaskPattern('HighTerm')
  comp.addTaskPattern('HighPhrase')

  checkout = 'trunk'

  index = comp.newIndex(checkout, sourceData, numThreads=12, addDVFields=True,
                        grouping=False, useCMS=True,
                        analyzer = 'StandardAnalyzerNoStopWords',
                        facets = (('taxonomy:Date', 'Date'),
                                  ('taxonomy:Month', 'Month'),
                                  ('taxonomy:DayOfYear', 'DayOfYear'),
                                  ('taxonomy:RandomLabel.taxonomy', 'RandomLabel'),
                                  ('sortedset:Month', 'Month'),
                                  ('sortedset:DayOfYear', 'DayOfYear'),
                                  ('sortedset:RandomLabel.sortedset', 'RandomLabel')))
  comp.competitor('base1', checkout, index=index, pk=False, searchConcurrency=6, numThreads=1)
  comp.competitor('base2', checkout, index=index, searchConcurrency=6, pk=False, numThreads=1)
  comp.benchmark('atoa')

And got these results after 10 iterations:

                            Task   QPS base1      StdDev   QPS base2      StdDev                Pct diff p-value
                      HighPhrase      131.04      (4.1%)      130.13      (3.9%)   -0.7% (  -8% -    7%) 0.697
            HighTermTitleBDVSort        6.57      (5.2%)        6.60      (3.7%)    0.5% (  -7% -    9%) 0.817
                        HighTerm      188.17      (6.2%)      190.00      (6.6%)    1.0% ( -11% -   14%) 0.735
               HighTermMonthSort      898.48      (5.4%)      908.32      (2.4%)    1.1% (  -6% -    9%) 0.556
           HighTermDayOfYearSort      242.70      (4.6%)      246.32      (5.2%)    1.5% (  -7% -   11%) 0.495
               HighTermTitleSort       43.92      (6.5%)       44.73      (9.4%)    1.8% ( -13% -   18%) 0.612

I ran it again and got different QPS for each category because luceneutil by default will pick a different subset of 20 tasks for each category, unless you pass a fixed staticSeed:

                            Task   QPS base1      StdDev   QPS base2      StdDev                Pct diff p-value
            HighTermTitleBDVSort        5.31      (9.4%)        5.10      (6.7%)   -4.0% ( -18% -   13%) 0.280
                        HighTerm      202.40      (4.5%)      194.93      (5.7%)   -3.7% ( -13% -    6%) 0.110
               HighTermTitleSort       35.86      (6.2%)       35.76      (6.1%)   -0.3% ( -11% -   12%) 0.917
           HighTermDayOfYearSort      251.88      (3.0%)      251.36      (3.6%)   -0.2% (  -6% -    6%) 0.889
               HighTermMonthSort      920.78      (2.4%)      919.35      (1.3%)   -0.2% (  -3% -    3%) 0.857
                      HighPhrase      166.13      (4.5%)      168.74      (2.9%)    1.6% (  -5% -    9%) 0.352

mikemccand commented 3 months ago

So this grouping change ensures that we run only tasks of the same type at the same time, enabling us to attribute the wall clock times in the logs to that task unambiguously.

Maybe this is also fixed by forcing numThreads=1 (number of in-flight queries at once) whenever searchConcurrency > 0? I.e. the issue was that the N worker thread were executing work units from different task categories?

mikemccand commented 3 months ago

OK I had wondered why the nightly benchmarks didn't show the (trappy) "effective QPS" slowdown when we enabled searchConcurrency=1 and it turns out ... it did! (It just had not successfully run since we enabled single threaded intra-query concurrent search): https://home.apache.org/~mikemccand/lucenebench/Term.html

I'll change the nightly bench to use higher intra-query concurrency, with only one query in flight at once, so it doesn't take forever to run. But the reported effective QPS will be wonky (too high).

I think we really need to switch the the true per-thread CPU count aggregation from the JMX bean. I'll open a spinoff for that ...

mikemccand / luceneutil

Close traps when testing concurrent search #275