Open mikemccand opened 3 months ago
It's the last boolean
argument to LocalTaskSource
ctor which groups all tasks by category together (running them all sequentially within each thread) when concurrency is enabled: https://github.com/mikemccand/luceneutil/commit/87a806341b008e959376ab0f1c8cfc0997a07d7a
Grouping was added here: https://github.com/mikemccand/luceneutil/pull/56 because without it the QPS measurement per task was meaningless
Grouping was added here: #56 because without it the QPS measurement per task was meaningless
OH! Thanks for digging @msokolov ... hmm I think this is why we have the staticSeed
: it would force same subset of tasks to be selected for each run (if I'm understanding the original issue correctly).
QPS is such a tricky metric ...
OH! Thanks for digging @msokolov ... hmm I think this is why we have the staticSeed: it would force same subset of tasks to be selected for each run (if I'm understanding the original issue correctly).
It's been a while but I think the issue was that we want to measure wall clock time per task This works OK when we run single threaded since the times in the log represent times spent working on that single task. However when we run multiple tasks concurrently the wall clock times represent work done on all the tasks we are running, which can be of mixed types. So this grouping change ensures that we run only tasks of the same type at the same time, enabling us to attribute the wall clock times in the logs to that task unambiguously.
I guess we could switch from measuring wall clock time to measuring CPU time using Java's JMX API. If we did that we wouldn't need to group tasks this way, and would possibly get a more realistic task mix in order to simulate cross-task effects in a mixed query load.
I guess we could switch from measuring wall clock time to measuring CPU time using Java's JMX API. If we did that we wouldn't need to group tasks this way, and would possibly get a more realistic task mix in order to simulate cross-task effects in a mixed query load.
+1 to explore this! This would be better than elapsed wall clock time... e.g. this would've prevented the whole confusion around lower QPS because multiple queries are running concurrently intra and inter query concurrency ...
The thing that caused me to dig into this particular issue was watching the java
process running SearchPerfTest
with six worker threads spin at 100% CPU for quite a while (10 seconds maybe) before then jumping up to that (correct, saturated) 600% CPU ... when I pulled the thread stacks I could see that all threads at the start were working on the same single-threaded Browse
facet counting tasks. So I was baffled why all threads were working on just that one task for all the initial tasks ...
Hmm OK I tried reverting that change, so we no longer group by task category:
diff --git a/src/main/perf/SearchPerfTest.java b/src/main/perf/SearchPerfTest.java
index e77ceec..9983092 100755
--- a/src/main/perf/SearchPerfTest.java
+++ b/src/main/perf/SearchPerfTest.java
@@ -560,7 +560,7 @@ public class SearchPerfTest {
final int taskRepeatCount = args.getInt("-taskRepeatCount");
final int numTaskPerCat = args.getInt("-tasksPerCat");
tasks = new LocalTaskSource(indexState, tasksFile, taskParserFactory.getTaskParser(), staticRandom, random,
- numTaskPerCat, taskRepeatCount, doPKLookup, searchConcurrency != 0);
+ numTaskPerCat, taskRepeatCount, doPKLookup, false);
System.out.println("Task repeat count " + taskRepeatCount);
System.out.println("Tasks file " + tasksFile);
System.out.println("Num task per cat " + numTaskPerCat);
Then I ran this perf.py
(6 worker threads for intra-query concurrency, only one query in flight at once, picking 20 examples of each category (i.e. 20 HighTerm
queries, each of which will have very different qps)):
sourceData = competition.sourceData('wikimediumall')
comp = competition.Competition(taskRepeatCount=20, taskCountPerCat=5, verifyCounts=False)
comp.addNotTaskPattern('BrowseDateSSDVFacets')
# comp.addTaskPattern('BrowseDayOfYearTaxoFacets')
# comp.addTaskPattern('OrHighHigh')
comp.addTaskPattern('HighTerm')
comp.addTaskPattern('HighPhrase')
checkout = 'trunk'
index = comp.newIndex(checkout, sourceData, numThreads=12, addDVFields=True,
grouping=False, useCMS=True,
analyzer = 'StandardAnalyzerNoStopWords',
facets = (('taxonomy:Date', 'Date'),
('taxonomy:Month', 'Month'),
('taxonomy:DayOfYear', 'DayOfYear'),
('taxonomy:RandomLabel.taxonomy', 'RandomLabel'),
('sortedset:Month', 'Month'),
('sortedset:DayOfYear', 'DayOfYear'),
('sortedset:RandomLabel.sortedset', 'RandomLabel')))
comp.competitor('base1', checkout, index=index, pk=False, searchConcurrency=6, numThreads=1)
comp.competitor('base2', checkout, index=index, searchConcurrency=6, pk=False, numThreads=1)
comp.benchmark('atoa')
And got these results after 10 iterations:
Task QPS base1 StdDev QPS base2 StdDev Pct diff p-value
HighPhrase 131.04 (4.1%) 130.13 (3.9%) -0.7% ( -8% - 7%) 0.697
HighTermTitleBDVSort 6.57 (5.2%) 6.60 (3.7%) 0.5% ( -7% - 9%) 0.817
HighTerm 188.17 (6.2%) 190.00 (6.6%) 1.0% ( -11% - 14%) 0.735
HighTermMonthSort 898.48 (5.4%) 908.32 (2.4%) 1.1% ( -6% - 9%) 0.556
HighTermDayOfYearSort 242.70 (4.6%) 246.32 (5.2%) 1.5% ( -7% - 11%) 0.495
HighTermTitleSort 43.92 (6.5%) 44.73 (9.4%) 1.8% ( -13% - 18%) 0.612
I ran it again and got different QPS for each category because luceneutil
by default will pick a different subset of 20 tasks for each category, unless you pass a fixed staticSeed
:
Task QPS base1 StdDev QPS base2 StdDev Pct diff p-value
HighTermTitleBDVSort 5.31 (9.4%) 5.10 (6.7%) -4.0% ( -18% - 13%) 0.280
HighTerm 202.40 (4.5%) 194.93 (5.7%) -3.7% ( -13% - 6%) 0.110
HighTermTitleSort 35.86 (6.2%) 35.76 (6.1%) -0.3% ( -11% - 12%) 0.917
HighTermDayOfYearSort 251.88 (3.0%) 251.36 (3.6%) -0.2% ( -6% - 6%) 0.889
HighTermMonthSort 920.78 (2.4%) 919.35 (1.3%) -0.2% ( -3% - 3%) 0.857
HighPhrase 166.13 (4.5%) 168.74 (2.9%) 1.6% ( -5% - 9%) 0.352
So this grouping change ensures that we run only tasks of the same type at the same time, enabling us to attribute the wall clock times in the logs to that task unambiguously.
Maybe this is also fixed by forcing numThreads=1
(number of in-flight queries at once) whenever searchConcurrency > 0
? I.e. the issue was that the N worker thread were executing work units from different task categories?
OK I had wondered why the nightly benchmarks didn't show the (trappy) "effective QPS" slowdown when we enabled searchConcurrency=1
and it turns out ... it did! (It just had not successfully run since we enabled single threaded intra-query concurrent search): https://home.apache.org/~mikemccand/lucenebench/Term.html
I'll change the nightly bench to use higher intra-query concurrency, with only one query in flight at once, so it doesn't take forever to run. But the reported effective QPS will be wonky (too high).
I think we really need to switch the the true per-thread CPU count aggregation from the JMX bean. I'll open a spinoff for that ...
Spinoff from https://github.com/apache/lucene/pull/13472 where we thought we saw massive gains from an improvement to how Lucene implements intra-query concurrency, but, at least some of the gains were due to
luceneutil
trappiness. Some ideas:Competitor
'snumThreads
tonumConcurrentQueries
(or maybenumQueriesInFlight
?) to make it clear this controls how many concurrent queries are sent toIndexSearcher
searchConcurrency > 1
maybe we should not allownumThreads>1
? This is so trappy because the reported QPS gain from running concurrently is trappily divided bynumThreads
(see details from this comment)luceneutil
? This would drive enough threads (concurrent queries or worker thread pool) to saturate CPU and/or IO. This is a tricky change tho ... we should open a separate spinoff. E.g. the precise set of tasks matters, their mix and order matters, reproducible arrival (Poisson-like) times matter, etc.luceneutil
runs enough concurrent things (queries or work units) such that the tests runs as efficiently as possible, but, does not bump up against false contention on the box (i.e. does NOT saturate CPU or IO) that mess up the resultsluceneutil
should not even report QPS but rather latency? It's confusing because it is an "effective QPS" (1/avg-of-median-query-wall-clock-time-after-discarding-warmup-and-outliers) not an actual "red-line capacity".Also, strangely, if
searchConcurrency > 1
, we oddly group the tasks so that all tasks within a single cat are done at the same time (across all threads). This is weird e.g. theBrowse*
tasks are single threaded even if you ask for concurrency and sotop
will show only 100% CPU used if you are running with six threads ...