Cache costly subqueries asynchronously [LUCENE-8213]

mikemccand commented 6 years ago

IndexOrDocValuesQuery allows to combine costly range queries with a selective lead iterator in an optimized way. However, the range query at some point gets cached by a querying thread in LRUQueryCache, which negates the optimization of IndexOrDocValuesQuery for that specific query.

It would be nice to see an asynchronous caching implementation in such cases, so that queries involving IndexOrDocValuesQuery would have consistent performance characteristics.

Legacy Jira details

LUCENE-8213 by Amir Hadadi on Mar 17 2018, updated Nov 28 2019 Attachments: 0001-Reproduce-across-segment-caching-of-same-query.patch, thetaphi_Lucene-Solr-master-Linux_24839.log.txt Linked issues:

LUCENE-9002

mikemccand commented 6 years ago

This problem applies to most queries actually. Creating cache entries asynchronously to prevent query caching from hurting latency is an interesting idea.

[Legacy Jira: Adrien Grand (@jpountz) on Mar 17 2018]

mikemccand commented 6 years ago

@jpountz, maxCostFactor can be used to decide whether to cache synchronously or asynchronously. Its default value can then be lowered from 100 to something close to 1.

WDYT?

[Legacy Jira: Amir Hadadi on Mar 18 2018]

mikemccand commented 6 years ago

It causes a latency hit regardless: using a threadpool doesnt matter. I don't think we should introduce additional threadpools because servers using lucene already use FAR too many threads, sorry.

If the user passes an executor to IndexSearcher, they've already said they don't have much throughput and they'd rather trade it for latency. So in that case i think its fine to use that threadpool for this.

[Legacy Jira: Robert Muir (@rmuir) on Mar 18 2018]

mikemccand commented 6 years ago

"It causes a latency hit regardless"

@rmuir I will clarify the issue. Assume you have these queries:

q1: term query that matches 1 doc

q2: range query that matches 10M docs

And you query for "q1 AND q2"

q2 would be an IndexOrDocValuesQuery and would use doc values to do a range check on the single doc that matches q1, so the BKD tree won't be scanned for 10M documents.

However, when q2 gets cached, the bit set for the entire 10M docs is cached, which causes "q1 AND q2" to suddenly spike in latency.

actually the recent introduction of maxCostFactor will prevent caching 10M docs since q1 matches only one term, but as maxCostFactor is 100 by default, a latency hit can occur in similar cases.

[Legacy Jira: Amir Hadadi on Mar 18 2018]

mikemccand commented 6 years ago

Your description sounds like maybe a different problem, where IndexOrDocValuesQuery chooses the wrong execution path to fill the "filter" bitset? If it uses docvalues instead of a BKD tree, because it must cache a large range that definitely seems like its bad.

[Legacy Jira: Robert Muir (@rmuir) on Mar 19 2018]

mikemccand commented 6 years ago

@rcmuir the issue is that the execution path for (q1 AND q2) depends on whether q2 gets cached or not. When q2 does not get cached, doc values are used to execute q2 and only the single document matching q1 is evaluated against the range. When q2 gets cached, it gets cached as a query that stands by itself, i.e. not in the context of (q1 AND q2). So the entire 10M documents that q2 matches are scanned in the BKD tree and get cached to a bit set. To protect against the caching of q2 causing the latency of (q1 AND q2) to be too high, @jpountz added maxCostFactor. This factor checks whether the cost of caching q2 is higher by more than maxCostFactor than the cost of evaluating (q1 AND q2). This is the relevant code from LRUQueryCache:

double costFactor = (double) inSupplier.cost() / leadCost;
if (costFactor >= maxCostFactor) {
  // too costly, caching might make the query much slower
  return inSupplier.get(leadCost);
}

My suggestion is to always evaluate (q1 AND q2) using the optimal execution path, and cache q2 asynchrounously. A refinement is to cache q2 synchronously if the cost of caching it is not too high.

[Legacy Jira: Amir Hadadi on Mar 19 2018]

mikemccand commented 6 years ago

I didn't remember pushing this maxCostFactor prototype I had been working on and indeed I had pushed it inadvertently with changes on LUCENE-8008. It has one issue that I didn't like which is that you can't use the bulk scorer for caching anymore, which matters for disjunctions.

Regarding this issue, maybe one way to make it use the executor from IndexSearcher would be to add an executor parameter to QueryCache.doCache so that IndexSearcher can pass its executor to the cache.

[Legacy Jira: Adrien Grand (@jpountz) on Mar 20 2018]

mikemccand commented 6 years ago

@jpountz Currently if an executor is used in IndexSearcher, it means that segments should be queried in parallel.

I guess you wouldn't want to conflate parallel segments querying and async caching.

[Legacy Jira: Amir Hadadi on Mar 20 2018]

mikemccand commented 6 years ago

Actually I think we want this: both parallel segments querying and async caching are about trading throughput for latency?

[Legacy Jira: Adrien Grand (@jpountz) on Mar 21 2018]

mikemccand commented 6 years ago

Indeed both are about trading throughput for latency.

However, there is a quantitative difference:

in parallel segments querying you would slice your index to e.g. 5 slices on each and every query.

async caching would happen only when caching is needed, and even then, only when the ratio between the caching cost and the leading query cost is big enough to justify async execution.

I would expect the additional async tasks triggered by async caching to be 100x less than parallel segments querying tasks.

Coupling these features together would mean that if someone is not willing to pay the overhead of parallel segments querying, he will not be able to use async caching.

[Legacy Jira: Amir Hadadi on Mar 21 2018]

mikemccand commented 6 years ago

You can still create an IndexSearcher with a single slice and/or wrap the cache to use a custom threadpool instead of the one from IndexSearcher, this is fine. I'd just like to make common use-cases, such as trading throughput for latency, simple.

[Legacy Jira: Adrien Grand (@jpountz) on Mar 21 2018]

mikemccand commented 5 years ago

I raised a PR for the same. The performance number from newly enhanced luceneutil for wikimedium10m are:

Latencies:

Task ('Wildcard', None)	P50 Base 2.045201	P50 Cmp 2.089539	Pct Diff 2.16790427933	P90 Base 18.845334	P90 Cmp 35.346911	Pct Diff 87.5631973411	P99 Base 83.02344	P99 Cmp 48.300884	Pct Diff -41.8225937157	P999 Base 83.02344	P999 Cmp 48.300884	Pct Diff -41.8225937157	P100 Base 249.902876	P100 Cmp 87.512667	Pct Diff -64.9813285862
Task ('HighTermDayOfYearSort', 'DayOfYear')	P50 Base 4.295828	P50 Cmp 4.727759	Pct Diff 10.0546623375	P90 Base 9.037488	P90 Cmp 55.639159	Pct Diff 515.648496573	P99 Base 82.149576	P99 Cmp 81.261365	Pct Diff -1.08121191033	P999 Base 82.149576	P999 Cmp 81.261365	Pct Diff -1.08121191033	P100 Base 86.642014	P100 Cmp 168.84768	Pct Diff 94.8796804285
Task ('MedSloppyPhrase', None)	P50 Base 9.18549	P50 Cmp 8.683321	Pct Diff -5.46698107559	P90 Base 29.233836	P90 Cmp 30.984274	Pct Diff 5.98771232075	P99 Base 34.303039	P99 Cmp 35.978633	Pct Diff 4.88468091705	P999 Base 34.303039	P999 Cmp 35.978633	Pct Diff 4.88468091705	P100 Base 181.426025	P100 Cmp 261.742214	Pct Diff 44.2693869306
Task ('OrHighHigh', None)	P50 Base 20.997779	P50 Cmp 16.938239	Pct Diff -19.3331875719	P90 Base 26.989668	P90 Cmp 29.711731	Pct Diff 10.0855742279	P99 Base 71.1345	P99 Cmp 72.914457	Pct Diff 2.50224152837	P999 Base 71.1345	P999 Cmp 72.914457	Pct Diff 2.50224152837	P100 Base 288.85441	P100 Cmp 203.02949	Pct Diff -29.7121723016
Task ('MedPhrase', None)	P50 Base 6.935508	P50 Cmp 6.676061	Pct Diff -3.74085070625	P90 Base 8.834132	P90 Cmp 7.366097	Pct Diff -16.6177616545	P99 Base 61.645788	P99 Cmp 59.423887	Pct Diff -3.60430302229	P999 Base 61.645788	P999 Cmp 59.423887	Pct Diff -3.60430302229	P100 Base 65.592528	P100 Cmp 63.493249	Pct Diff -3.20048496987
Task ('LowSpanNear', None)	P50 Base 23.256239	P50 Cmp 23.17936	Pct Diff -0.330573658105	P90 Base 33.890598	P90 Cmp 34.205568	Pct Diff 0.929372801271	P99 Base 34.958863	P99 Cmp 34.857876	Pct Diff -0.288873811485	P999 Base 34.958863	P999 Cmp 34.857876	Pct Diff -0.288873811485	P100 Base 96.937787	P100 Cmp 121.889403	Pct Diff 25.7398242442
Task ('Fuzzy2', None)	P50 Base 25.45292	P50 Cmp 25.25128	Pct Diff -0.792207730979	P90 Base 79.376572	P90 Cmp 106.649481	Pct Diff 34.3588899254	P99 Base 108.933154	P99 Cmp 122.051216	Pct Diff 12.0423044026	P999 Base 108.933154	P999 Cmp 122.051216	Pct Diff 12.0423044026	P100 Base 212.373308	P100 Cmp 209.138442	Pct Diff -1.52319800942
Task ('OrNotHighHigh', None)	P50 Base 1.903331	P50 Cmp 2.16024	Pct Diff 13.4978624317	P90 Base 4.890325	P90 Cmp 4.723459	Pct Diff -3.4121658581	P99 Base 102.556452	P99 Cmp 102.641448	Pct Diff 0.0828772820651	P999 Base 102.556452	P999 Cmp 102.641448	Pct Diff 0.0828772820651	P100 Base 226.783706	P100 Cmp 308.709148	Pct Diff 36.1249242483
Task ('OrHighNotLow', None)	P50 Base 1.434646	P50 Cmp 1.52378	Pct Diff 6.21296124619	P90 Base 3.905319	P90 Cmp 4.569729	Pct Diff 17.0129507986	P99 Base 6.321682	P99 Cmp 7.281513	Pct Diff 15.1831585328	P999 Base 6.321682	P999 Cmp 7.281513	Pct Diff 15.1831585328	P100 Base 7.720665	P100 Cmp 15.035781	Pct Diff 94.7472270847
Task ('BrowseMonthSSDVFacets', None)	P50 Base 93.940495	P50 Cmp 93.939183	Pct Diff -0.00139662879145	P90 Base 102.50354	P90 Cmp 98.604983	Pct Diff -3.80333888956	P99 Base 103.572854	P99 Cmp 106.785928	Pct Diff 3.10223564951	P999 Base 103.572854	P999 Cmp 106.785928	Pct Diff 3.10223564951	P100 Base 283.457123	P100 Cmp 244.054099	Pct Diff -13.9008762888
Task ('Fuzzy1', None)	P50 Base 26.559456	P50 Cmp 29.050383	Pct Diff 9.37868230434	P90 Base 159.424881	P90 Cmp 171.063113	Pct Diff 7.30013529068	P99 Base 339.7673	P99 Cmp 179.733118	Pct Diff -47.1011136151	P999 Base 339.7673	P999 Cmp 179.733118	Pct Diff -47.1011136151	P100 Base 417.349072	P100 Cmp 395.168736	Pct Diff -5.31457657105
Task ('HighSloppyPhrase', None)	P50 Base 9.489382	P50 Cmp 9.980939	Pct Diff 5.18007389733	P90 Base 14.424659	P90 Cmp 15.315198	Pct Diff 6.17372653315	P99 Base 37.046395	P99 Cmp 31.348423	Pct Diff -15.380638251	P999 Base 37.046395	P999 Cmp 31.348423	Pct Diff -15.380638251	P100 Base 51.797966	P100 Cmp 33.660774	Pct Diff -35.0152590934
Task ('OrNotHighMed', None)	P50 Base 1.605631	P50 Cmp 1.549948	Pct Diff -3.46798236955	P90 Base 16.030506	P90 Cmp 11.175798	Pct Diff -30.2841844169	P99 Base 63.933462	P99 Cmp 63.33348	Pct Diff -0.938447537848	P999 Base 63.933462	P999 Cmp 63.33348	Pct Diff -0.938447537848	P100 Base 176.946354	P100 Cmp 141.426012	Pct Diff -20.0740739761
Task ('LowPhrase', None)	P50 Base 5.294719	P50 Cmp 5.398776	Pct Diff 1.96529787511	P90 Base 14.581045	P90 Cmp 14.201875	Pct Diff -2.6004309019	P99 Base 28.226324	P99 Cmp 17.160324	Pct Diff -39.2045382884	P999 Base 28.226324	P999 Cmp 17.160324	Pct Diff -39.2045382884	P100 Base 170.879755	P100 Cmp 141.494236	Pct Diff -17.1966064675
Task ('AndHighLow', None)	P50 Base 1.407978	P50 Cmp 1.356986	Pct Diff -3.62164749733	P90 Base 4.452746	P90 Cmp 6.676942	Pct Diff 49.9511088214	P99 Base 69.005964	P99 Cmp 69.618355	Pct Diff 0.887446482162	P999 Base 69.005964	P999 Cmp 69.618355	Pct Diff 0.887446482162	P100 Base 302.690572	P100 Cmp 288.210249	Pct Diff -4.7838698458
Task ('LowTerm', None)	P50 Base 0.94373	P50 Cmp 0.984236	Pct Diff 4.29211744885	P90 Base 1.776916	P90 Cmp 1.95572	Pct Diff 10.0626028467	P99 Base 2.130259	P99 Cmp 2.967897	Pct Diff 39.3209464201	P999 Base 2.130259	P999 Cmp 2.967897	Pct Diff 39.3209464201	P100 Base 3.471183	P100 Cmp 11.055948	Pct Diff 218.50663016
Task ('BrowseDayOfYearSSDVFacets', None)	P50 Base 126.336025	P50 Cmp 125.101777	Pct Diff -0.976956493605	P90 Base 135.381181	P90 Cmp 137.867944	Pct Diff 1.8368601763	P99 Base 144.330923	P99 Cmp 144.977097	Pct Diff 0.447703088547	P999 Base 144.330923	P999 Cmp 144.977097	Pct Diff 0.447703088547	P100 Base 295.783063	P100 Cmp 302.394898	Pct Diff 2.23536632995
Task ('AndHighHigh', None)	P50 Base 5.089657	P50 Cmp 5.20707	Pct Diff 2.30689415809	P90 Base 6.822905	P90 Cmp 8.306661	Pct Diff 21.7466900096	P99 Base 7.038109	P99 Cmp 9.130519	Pct Diff 29.7297185934	P999 Base 7.038109	P999 Cmp 9.130519	Pct Diff 29.7297185934	P100 Base 111.532214	P100 Cmp 36.808683	Pct Diff -66.9972632302
Task ('HighTermMonthSort', 'Month')	P50 Base 2.42148	P50 Cmp 2.464948	Pct Diff 1.79510051704	P90 Base 4.411412	P90 Cmp 4.012971	Pct Diff -9.03205141574	P99 Base 4.993632	P99 Cmp 4.667476	Pct Diff -6.53143844	P999 Base 4.993632	P999 Cmp 4.667476	Pct Diff -6.53143844	P100 Base 98.675367	P100 Cmp 94.641052	Pct Diff -4.08847225265
Task Respell	P50 Base 17.339473	P50 Cmp 16.495659	Pct Diff -4.86643394525	P90 Base 202.898846	P90 Cmp 190.452875	Pct Diff -6.13407678031	P99 Base 232.699615	P99 Cmp 206.63496	Pct Diff -11.2009875908	P999 Base 232.699615	P999 Cmp 206.63496	Pct Diff -11.2009875908	P100 Base 259.083639	P100 Cmp 261.411887	Pct Diff 0.89864725113
Task ('MedTerm', None)	P50 Base 1.044861	P50 Cmp 1.046421	Pct Diff 0.149302155981	P90 Base 10.095498	P90 Cmp 2.086543	Pct Diff -79.3319457842	P99 Base 31.02937	P99 Cmp 4.311224	Pct Diff -86.1059892611	P999 Base 31.02937	P999 Cmp 4.311224	Pct Diff -86.1059892611	P100 Base 37.048313	P100 Cmp 30.992555	Pct Diff -16.3455701748
Task ('HighIntervalsOrdered', None)	P50 Base 18.204528	P50 Cmp 18.50288	Pct Diff 1.63888896213	P90 Base 28.38481	P90 Cmp 28.186101	Pct Diff -0.70005400776	P99 Base 28.553312	P99 Cmp 30.147999	Pct Diff 5.58494580243	P999 Base 28.553312	P999 Cmp 30.147999	Pct Diff 5.58494580243	P100 Base 85.526627	P100 Cmp 248.898567	Pct Diff 191.018804004
Task ('HighSpanNear', None)	P50 Base 19.317909	P50 Cmp 18.448229	Pct Diff -4.50193651911	P90 Base 26.886654	P90 Cmp 27.011485	Pct Diff 0.464286110127	P99 Base 27.066426	P99 Cmp 28.519568	Pct Diff 5.36879896888	P999 Base 27.066426	P999 Cmp 28.519568	Pct Diff 5.36879896888	P100 Base 27.949757	P100 Cmp 28.532982	Pct Diff 2.08669077159
Task ('LowSloppyPhrase', None)	P50 Base 17.995202	P50 Cmp 17.321666	Pct Diff -3.74286434795	P90 Base 27.136218	P90 Cmp 23.990793	Pct Diff -11.5912431128	P99 Base 27.177319	P99 Cmp 25.467183	Pct Diff -6.29251178161	P999 Base 27.177319	P999 Cmp 25.467183	Pct Diff -6.29251178161	P100 Base 37.041781	P100 Cmp 26.387058	Pct Diff -28.7640677968
Task ('OrHighNotHigh', None)	P50 Base 1.721541	P50 Cmp 1.683098	Pct Diff -2.23305747583	P90 Base 2.334157	P90 Cmp 2.147467	Pct Diff -7.99817664364	P99 Base 3.127609	P99 Cmp 3.205909	Pct Diff 2.50350986968	P999 Base 3.127609	P999 Cmp 3.205909	Pct Diff 2.50350986968	P100 Base 10.177133	P100 Cmp 5.218394	Pct Diff -48.7243214764
Task ('OrNotHighLow', None)	P50 Base 1.467553	P50 Cmp 1.4717	Pct Diff 0.282579232232	P90 Base 11.130311	P90 Cmp 9.27055	Pct Diff -16.7089760565	P99 Base 12.92663	P99 Cmp 12.305485	Pct Diff -4.80515803423	P999 Base 12.92663	P999 Cmp 12.305485	Pct Diff -4.80515803423	P100 Base 41.801941	P100 Cmp 42.736573	Pct Diff 2.23585789952
Task ('MedSpanNear', None)	P50 Base 8.050703	P50 Cmp 8.0656	Pct Diff 0.18503974125	P90 Base 12.37128	P90 Cmp 11.972162	Pct Diff -3.22616576458	P99 Base 13.276762	P99 Cmp 13.410559	Pct Diff 1.00775324586	P999 Base 13.276762	P999 Cmp 13.410559	Pct Diff 1.00775324586	P100 Base 389.340697	P100 Cmp 524.654134	Pct Diff 34.7545062827
Task ('BrowseDateTaxoFacets', None)	P50 Base 429.062505	P50 Cmp 428.859358	Pct Diff -0.0473467146704	P90 Base 436.499827	P90 Cmp 441.680746	Pct Diff 1.18692349447	P99 Base 458.955056	P99 Cmp 443.442175	Pct Diff -3.38004360061	P999 Base 458.955056	P999 Cmp 443.442175	Pct Diff -3.38004360061	P100 Base 516.629721	P100 Cmp 518.247598	Pct Diff 0.313159877227
Task ('HighPhrase', None)	P50 Base 4.346807	P50 Cmp 4.173837	Pct Diff -3.9792426947	P90 Base 5.762266	P90 Cmp 6.035056	Pct Diff 4.7340751017	P99 Base 7.09179	P99 Cmp 8.179865	Pct Diff 15.3427413953	P999 Base 7.09179	P999 Cmp 8.179865	Pct Diff 15.3427413953	P100 Base 25.682734	P100 Cmp 26.872466	Pct Diff 4.63241958586
Task ('OrHighLow', None)	P50 Base 2.360377	P50 Cmp 2.373989	Pct Diff 0.576687537626	P90 Base 2.920517	P90 Cmp 4.037781	Pct Diff 38.2556923997	P99 Base 4.226431	P99 Cmp 4.511819	Pct Diff 6.75245851642	P999 Base 4.226431	P999 Cmp 4.511819	Pct Diff 6.75245851642	P100 Base 48.928035	P100 Cmp 13.15068	Pct Diff -73.1224031376
Task ('BrowseDayOfYearTaxoFacets', None)	P50 Base 0.168513	P50 Cmp 0.182987	Pct Diff 8.58924830725	P90 Base 0.204691	P90 Cmp 0.247465	Pct Diff 20.8968640536	P99 Base 15.455504	P99 Cmp 15.315246	Pct Diff -0.907495478633	P999 Base 15.455504	P999 Cmp 15.315246	Pct Diff -0.907495478633	P100 Base 16.866213	P100 Cmp 19.937509	Pct Diff 18.2097546142
Task ('OrHighMed', None)	P50 Base 3.574777	P50 Cmp 3.231059	Pct Diff -9.6150892769	P90 Base 24.188598	P90 Cmp 4.638494	Pct Diff -80.8236343421	P99 Base 46.693976	P99 Cmp 8.113618	Pct Diff -82.6238442406	P999 Base 46.693976	P999 Cmp 8.113618	Pct Diff -82.6238442406	P100 Base 269.409786	P100 Cmp 23.431229	Pct Diff -91.3027550529
Task ('IntNRQ', None)	P50 Base 4.011057	P50 Cmp 4.923845	Pct Diff 22.7567945307	P90 Base 13.346714	P90 Cmp 12.29044	Pct Diff -7.91411279211	P99 Base 85.073004	P99 Cmp 67.905295	Pct Diff -20.1799727208	P999 Base 85.073004	P999 Cmp 67.905295	Pct Diff -20.1799727208	P100 Base 191.744926	P100 Cmp 194.853114	Pct Diff 1.62100143396
Task ('BrowseMonthTaxoFacets', None)	P50 Base 0.170601	P50 Cmp 0.172844	Pct Diff 1.31476368837	P90 Base 0.212197	P90 Cmp 0.260394	Pct Diff 22.7133277096	P99 Base 0.227793	P99 Cmp 4.692225	Pct Diff 1959.86356034	P999 Base 0.227793	P999 Cmp 4.692225	Pct Diff 1959.86356034	P100 Base 10.00639	P100 Cmp 9.966375	Pct Diff -0.399894467435
Task ('AndHighMed', None)	P50 Base 3.367039	P50 Cmp 3.095126	Pct Diff -8.07573063454	P90 Base 6.445771	P90 Cmp 4.890087	Pct Diff -24.1349560821	P99 Base 6.646047	P99 Cmp 6.78425	Pct Diff 2.07947671751	P999 Base 6.646047	P999 Cmp 6.78425	Pct Diff 2.07947671751	P100 Base 42.892862	P100 Cmp 96.998233	Pct Diff 126.140734092
Task ('OrHighNotMed', None)	P50 Base 1.75091	P50 Cmp 1.851148	Pct Diff 5.72490876173	P90 Base 3.059898	P90 Cmp 3.291761	Pct Diff 7.57747480472	P99 Base 5.304103	P99 Cmp 4.755972	Pct Diff -10.3340941909	P999 Base 5.304103	P999 Cmp 4.755972	Pct Diff -10.3340941909	P100 Base 21.934264	P100 Cmp 16.361938	Pct Diff -25.4046636805
Task ('Prefix3', None)	P50 Base 15.061956	P50 Cmp 14.031055	Pct Diff -6.84440321031	P90 Base 28.636916	P90 Cmp 25.283538	Pct Diff -11.7099830163	P99 Base 425.912794	P99 Cmp 443.836014	Pct Diff 4.20819009255	P999 Base 425.912794	P999 Cmp 443.836014	Pct Diff 4.20819009255	P100 Base 497.136127	P100 Cmp 551.753833	Pct Diff 10.9864689033
Task PKLookup	P50 Base 11.334723	P50 Cmp 10.857995	Pct Diff -4.20590781089	P90 Base 89.284901	P90 Cmp 32.062856	Pct Diff -64.0892741764	P99 Base 334.811782	P99 Cmp 73.551516	Pct Diff -78.0319809654	P999 Base 334.811782	P999 Cmp 73.551516	Pct Diff -78.0319809654	P100 Base 363.062965	P100 Cmp 418.731587	Pct Diff 15.3330489107
Task ('HighTerm', None)	P50 Base 1.148459	P50 Cmp 1.212344	Pct Diff 5.56267137094	P90 Base 1.54	P90 Cmp 1.915795	Pct Diff 24.4022727273	P99 Base 2.138094	P99 Cmp 3.134936	Pct Diff 46.622926775	P999 Base 2.138094	P999 Cmp 3.134936	Pct Diff 46.622926775	P100 Base 2.722231	P100 Cmp 153.54749	Pct Diff 5540.50185308

[Legacy Jira: Atri Sharma (@atris) on Aug 05 2019]

mikemccand commented 5 years ago

Commit 0dfbf557bae67acaf9c920cb51a04b7c7341d621 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0dfbf55

LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache (#815)

Introduce asynchronous caching for heavy queries using the passed in Executor to IndexSearcher.

[Legacy Jira: ASF subversion and git services on Sep 28 2019]

mikemccand commented 5 years ago

Hi @atris,

This commit creates some CI errors:

ant test  -Dtestcase=TestLRUQueryCache -Dtests.method=testLRUEviction -Dtests.seed=DC3A333680E821EB -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=en-US -Dtests.timezone=Africa/Luanda -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

[Legacy Jira: Ignacio Vera (@iverase) on Sep 30 2019]

mikemccand commented 5 years ago

Taking a look – although I am unable to reproduce with a simple ant test.

Can you point me to the CI link so that I can dive deeper into the error output?

[Legacy Jira: Atri Sharma (@atris) on Sep 30 2019]

mikemccand commented 5 years ago

Here is the failure:

https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+master/5526/console

I had a look and it is a race condition. We are checking if the query has been cached just after executing a query but if doing it async, it might happen that the query is still not there.

[Legacy Jira: Ignacio Vera (@iverase) on Sep 30 2019]

mikemccand commented 5 years ago

Interesting – I did not realise that testLRUEviction could also cause LRUQueryCache to cache asynchronously, hence did not update it to handle the same (in the manner testLRUConcurrentLoadAndEviction does).

I have pushed a test fix now – beasted the test 50 times with the seed you provided, and also beasted the entire TestLRUQueryCache suite 20 times with the seed. Ran the entire Lucene test suite – came in clean.

It is curious to note that I could not reproduce the test failure without the seed even after running multiple times – kudos to the CI!

[Legacy Jira: Atri Sharma (@atris) on Sep 30 2019]

mikemccand commented 5 years ago

Kudos to randomized testing philosophy!

[Legacy Jira: David Smiley (@dsmiley) on Oct 01 2019]

mikemccand commented 5 years ago

Hi @atris,

We have another error in CI that I think it is related to this issue. You can reproduce it by running the following command:

ant beast -Dbeast.iters=10 -Dtests.dups=6  -Dtestcase=TestLatLonShape -Dtests.method=testLUCENE8736* -Dtests.seed=D784C9A46B62FB84 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=brx -Dtests.timezone=America/Guatemala -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

[Legacy Jira: Ignacio Vera (@iverase) on Oct 01 2019]

mikemccand commented 5 years ago

Opened https://github.com/apache/lucene-solr/pull/912

Curious – looks like the query being cached is insanely heavy – it takes more than 200ms to cache!

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

@ivera , Regarding the other failure in TestLRUQueryCache, I investigated and that is a definite case of async caching in action – the cache's cache count will not be updated until the async loading completes, before which the test cannot check the value in a deterministic manner. The solution is to retry in case the cache count does not match the target. Making that change lead to a clean beast of TestLRUQueryCache 50 times. I will push the change shortly.

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

Wait @atris

I don't think this is a proper solution as it means that depending when you query you will get a different result. I think is wrong as queries should always return the same hits regardless of the cache.

[Legacy Jira: Ignacio Vera (@iverase) on Oct 01 2019]

mikemccand commented 5 years ago

@ivera, I am not referring to the LatLonShape failure – I agree with you that the test should be fine even with async caching since it directly checks the result count.

However, for testMinSegmentSizePredicate, the failing assertion is on the cache's cache count – which is directly dependent on the race between the call and the completion of the async load (as the count is incremented only when the async load completes).

Hence, for testMinSegmentSizePredicate, the correct fix would be to retry the check a few times before marking a test failure. WDYT?

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

The inFlightAsyncLoadQueries logic doesn't look right. It seems like it would prevent caching on one segment if another segment is already caching the same query?

Also we shouldn't rely on Thread#sleep in tests, can we get rid of them?

[Legacy Jira: Adrien Grand (@jpountz) on Oct 01 2019]

mikemccand commented 5 years ago

@jpountz Good catch, My understanding was that the cache was always per segment, so just using the query as a cache key should be fine. Thanks for highlighting this issue.

Should I revert this commit?

Would using the incoming LeafReaderContext's top level ord + the query's key as the key in inFlightAsyncLoadQueries suffice to ensure that segments get their independent space in the cache?

RE: Thread#sleep, I agree that it is not a pretty pattern. Maybe replacing a CountDownLatch will work?

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

Attached is a patch which reproduces the issue @jpountz mentioned in a unit test. Using a Pair of Query + Segment's ID object as the Set (inFlightAsyncLoadQueries)'s key makes the test pass.

0001-Reproduce-across-segment-caching-of-same-query.patch

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

Should I revert this commit?

It would probably make things easier indeed, this way we can take the time to think about how to address the various issues.

Would using the incoming LeafReaderContext's top level ord + the query's key as the key in inFlightAsyncLoadQueries suffice to ensure that segments get their independent space in the cache?

I don't think the ord is enough as the query cache can be shared across multiple readers. This is what Elasticsearch does for instance, the cache is shared across all shards that are allocated on the same node. This issue is not specific to asynchronous caching, it could already happen before that the same query gets cached twice concurrently, which is wasteful. Maybe we should ignore this issue for now and work on addressing it as a follow-up.

RE: Thread#sleep, I agree that it is not a pretty pattern. Maybe replacing a CountDownLatch will work?

Anything that gives a signal when the caching is done instead of busy waiting would be better. One easy way to do this might be to shutdown the pool and call awaitTermination?

[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]

mikemccand commented 5 years ago

Hi @jpountz

It would probably make things easier indeed, this way we can take the time to think about how to address the various issues.

Sure, happy to revert. I managed to create a standalone unit test which reproduces this case, though, and posted a patch last night. I am in process of raising a PR with the fix and the test – would you want to take a look at that, or should I revert right away?

I don't think the ord is enough as the query cache can be shared across multiple readers

One way I hacked around this was to use the LeafReaderContext's ID field + the query as the unique identifier for the cache. This was inspired by the way CompositeReaderContext uses its ID field. Does this sounds a viable approach?

[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]

mikemccand commented 5 years ago

I'd rather revert to stop test failures as soon as possible. I'll take time to look at the PR later today.

Ideally, we'd use the core cache key of the reader? See LeafReader#getCoreCacheHelper.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]

mikemccand commented 5 years ago

Commit cd09808f6b54b33e350a0739dccb602cb270abad in lucene-solr's branch refs/heads/revert-815-LUCENE-8213 from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=cd09808

Revert "LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache (#815)"

This reverts commit 0dfbf557bae67acaf9c920cb51a04b7c7341d621.

[Legacy Jira: ASF subversion and git services on Oct 02 2019]

mikemccand commented 5 years ago

Commit 302cd09b4ce7bd3049f8480287b3dd03bb838b0d in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=302cd09

Revert "LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache (#815)" (#914)

This reverts commit 0dfbf557bae67acaf9c920cb51a04b7c7341d621.

[Legacy Jira: ASF subversion and git services on Oct 02 2019]

mikemccand commented 5 years ago

@jpountz Reverted.

I will use CacheHelper's core cache key + the query cache key as the inFlightAsyncMap's key

[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]

mikemccand commented 5 years ago

Can we do this as a separate change that applies to both the concurrent and the serial case?

[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]

mikemccand commented 5 years ago

@jpountz I am not sure I follow – inFlightAsyncQueries map will only be involved in the concurrent load case? I am not sure I get your thoughts behind fixing the serial case.

[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]

mikemccand commented 5 years ago

My thinking is that this is an attempt to avoid being wasteful by caching twice the same query concurrently. However this may not only occur when IndexSearcher is configured with an executor. This also happens when IndexSearcher is configured without an executor and two concurrent requests try to cache the same query. The cache simply ignores this issue today and we would generate a cache entry twice. So to make things easier to review I was suggesting ignoring this problem as well for the case that IndexSearcher is configured with an executor, and coming back later to this problem with a fix that works for all cases.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]

mikemccand commented 5 years ago

@jpountz Just noticed your latest comment – I raised a PR (https://github.com/apache/lucene-solr/pull/916) for this change – it includes the logic to use CacheKey + Query as the map's key as well (sorry about that).

Please let me know if this logic looks fine – I will raise another PR to use that combination as the key for the serial case.

[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]

mikemccand commented 5 years ago

FWIW: I just attached thetaphi_Lucene-Solr-master-Linux_24839.log.txt, a jenkins log that shows two other failures that seem to be related to this issue...

Checking out Revision 3c399bb696073bdb30f278309410a50effabd0e7 (refs/remotes/origin/master)
...
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestLatLonDocValuesQueries -Dtests.method=testAllLatEqual -Dtests.seed=8D56B48917EDB35F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=zh-Hans -Dtests.timezone=America/Bahia -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 2.67s J0 | TestLatLonDocValuesQueries.testAllLatEqual <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: wrong hit (first of possibly more):
   [junit4]    > FAIL: id=235 should match but did not
   [junit4]    >   query=point:polygons([[0.0, 1.401298464324817E-45] [35.0, 1.401298464324817E-45] [35.0, 180.0] [0.0, 180.0] [0.0, 1.401298464324817E-45] ]) docID=227
   [junit4]    >   lat=0.0 lon=41.82071185670793
   [junit4]    >   deleted?=false  polygon=[0.0, 1.401298464324817E-45] [35.0, 1.401298464324817E-45] [35.0, 180.0] [0.0, 180.0] [0.0, 1.401298464324817E-45] 
...
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestFieldCacheSortRandom -Dtests.method=testRandomStringSort -Dtests.seed=F7E42E6905E37945 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=khq-ML -Dtests.timezone=America/Chihuahua -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.15s J1 | TestFieldCacheSortRandom.testRandomStringSort <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: expected:<[65 76 65 64 6d 68 76 68 75 77 61 71 64 63 65 63 7a 79 77 77 63 69 71 62 76 70 7a 62 6d 66 75 67 61]> but was:<[66 6c 75 6e 78 63 67 66 63 77 7a 6b 69 6d 7a 77 62 6c 71 61 79 61 67 71 6f 69 66 71 64 6a 66]>

both of those seeds seem to fail reliably against GIT:3c399bb6960 (the version checked out by the jenkins run) but both seem to start passing reliably as of GIT:302cd09b4ce (when this issue was reverted)

I should point out however in case it's helpful: when the failures reproduce, the specifics of the failures are always different – suggesting that parallel thread execution is affecting the results, since the randomization of the index data should be deterministic based on the seed.

[Legacy Jira: Chris M. Hostetter on Oct 02 2019]

mikemccand commented 5 years ago

Thanks @hossman

Interestingly, all but one test failures are coming from LatLon queries – is there anything special about them?

[Legacy Jira: Atri Sharma (@atris) on Oct 03 2019]

mikemccand commented 5 years ago

I don't think these queries have anything special besides being a bit slow.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 04 2019]

mikemccand commented 5 years ago

Sorry to distract. I am trying to understand LRUQueryCache as it has a lot of excellent engineering invested into it. For asynchronous caching without duplicate loads, wouldn't a CompletableFuture as the cached value be suitable? The cache doesn't handle the cache stampedes, so the same entry may be computed twice with the first to insert as the winner, and the other threads dropping their work. Ideally by using memoization the other threads would defer to the first and avoid the costly operation. Similarly the cache lookup is skipped if the global lock is held and the value computed, but not stored, which means a busy lock reduces performance across all usages.

Would there be a benefit/liability in using a dedicated caching library like Caffeine? That does handle concurrency in a smoother fashion, which could be beneficial here. I am also curious if the hit ratios of LRU could be improved upon as search problems tend to have a strong frequency bias.

[Legacy Jira: Ben Manes on Nov 03 2019]

mikemccand commented 5 years ago

~ben.manes I believe this would work. Regarding Caffeine, I commented on the other issue.

[Legacy Jira: Adrien Grand (@jpountz) on Nov 07 2019]

mikemccand commented 5 years ago

If you want to give it a try, I'd be happy to review it.

[Legacy Jira: Adrien Grand (@jpountz) on Nov 07 2019]

mikemccand commented 4 years ago

Commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2d98a61

LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)

LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache

[Legacy Jira: ASF subversion and git services on Nov 26 2019]

mikemccand commented 4 years ago

Commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2d98a61

LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)

LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache

[Legacy Jira: ASF subversion and git services on Nov 26 2019]

mikemccand commented 4 years ago

Commit 09cd82a6fc51b90275ed326812e1d947c863952e in lucene-solr's branch refs/heads/revert-916-revised_LUCENE-8213 from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=09cd82a

Revert "LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)"

This reverts commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49.

[Legacy Jira: ASF subversion and git services on Nov 28 2019]

mikemccand commented 4 years ago

Commit 165f6b276c021b3328e3a37dc125f666ece458d4 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=165f6b2

Revert "LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)" (#1044)

This reverts commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49.

[Legacy Jira: ASF subversion and git services on Nov 28 2019]

mikemccand / stargazers-migration-test

Cache costly subqueries asynchronously [LUCENE-8213] #213

Legacy Jira details