mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Cache costly subqueries asynchronously [LUCENE-8213] #213

Open mikemccand opened 6 years ago

mikemccand commented 6 years ago

IndexOrDocValuesQuery allows to combine costly range queries with a selective lead iterator in an optimized way. However, the range query at some point gets cached by a querying thread in LRUQueryCache, which negates the optimization of IndexOrDocValuesQuery for that specific query.

It would be nice to see an asynchronous caching implementation in such cases, so that queries involving IndexOrDocValuesQuery would have consistent performance characteristics.


Legacy Jira details

LUCENE-8213 by Amir Hadadi on Mar 17 2018, updated Nov 28 2019 Attachments: 0001-Reproduce-across-segment-caching-of-same-query.patch, thetaphi_Lucene-Solr-master-Linux_24839.log.txt Linked issues:

mikemccand commented 6 years ago

This problem applies to most queries actually. Creating cache entries asynchronously to prevent query caching from hurting latency is an interesting idea.

[Legacy Jira: Adrien Grand (@jpountz) on Mar 17 2018]

mikemccand commented 6 years ago

@jpountz, maxCostFactor can be used to decide whether to cache synchronously or asynchronously. Its default value can then be lowered from 100 to something close to 1.

WDYT?

[Legacy Jira: Amir Hadadi on Mar 18 2018]

mikemccand commented 6 years ago

It causes a latency hit regardless: using a threadpool doesnt matter. I don't think we should introduce additional threadpools because servers using lucene already use FAR too many threads, sorry.

If the user passes an executor to IndexSearcher, they've already said they don't have much throughput and they'd rather trade it for latency. So in that case i think its fine to use that threadpool for this.

[Legacy Jira: Robert Muir (@rmuir) on Mar 18 2018]

mikemccand commented 6 years ago

"It causes a latency hit regardless"

@rmuir  I will clarify the issue. Assume you have these queries:

q1: term query that matches 1 doc

q2: range query that matches 10M docs

And you query for "q1 AND q2"

q2 would be an IndexOrDocValuesQuery and would use doc values to do a range check on the single doc that matches q1, so the BKD tree won't be scanned for 10M documents.

However, when q2 gets cached, the bit set for the entire 10M docs is cached, which causes "q1 AND q2" to suddenly spike in latency.

 

[Legacy Jira: Amir Hadadi on Mar 18 2018]

mikemccand commented 6 years ago

Your description sounds like maybe a different problem, where IndexOrDocValuesQuery chooses the wrong execution path to fill the "filter" bitset? If it uses docvalues instead of a BKD tree, because it must cache a large range that definitely seems like its bad.

[Legacy Jira: Robert Muir (@rmuir) on Mar 19 2018]

mikemccand commented 6 years ago

@rcmuir the issue is that the execution path for (q1 AND q2) depends on whether q2 gets cached or not. When q2 does not get cached, doc values are used to execute q2 and only the single document matching q1 is evaluated against the range. When q2 gets cached, it gets cached as a query that stands by itself, i.e. not in the context of (q1 AND q2). So the entire 10M documents that q2 matches are scanned in the BKD tree and get cached to a bit set. To protect against the caching of q2 causing the latency of (q1 AND q2) to be too high, @jpountz  added maxCostFactor. This factor checks whether the cost of caching q2 is higher by more than maxCostFactor than the cost of evaluating (q1 AND q2). This is the relevant code from LRUQueryCache:

 

double costFactor = (double) inSupplier.cost() / leadCost;
if (costFactor >= maxCostFactor) {
  // too costly, caching might make the query much slower
  return inSupplier.get(leadCost);
}

 

My suggestion is to always evaluate (q1 AND q2) using the optimal execution path, and cache q2 asynchrounously. A refinement is to cache q2 synchronously if the cost of caching it is not too high.

[Legacy Jira: Amir Hadadi on Mar 19 2018]

mikemccand commented 6 years ago

I didn't remember pushing this maxCostFactor prototype I had been working on and indeed I had pushed it inadvertently with changes on LUCENE-8008. It has one issue that I didn't like which is that you can't use the bulk scorer for caching anymore, which matters for disjunctions.

Regarding this issue, maybe one way to make it use the executor from IndexSearcher would be to add an executor parameter to QueryCache.doCache so that IndexSearcher can pass its executor to the cache.

[Legacy Jira: Adrien Grand (@jpountz) on Mar 20 2018]

mikemccand commented 6 years ago

@jpountz  Currently if an executor is used in IndexSearcher, it means that segments should be queried in parallel.

I guess you wouldn't want to conflate parallel segments querying and async caching.

[Legacy Jira: Amir Hadadi on Mar 20 2018]

mikemccand commented 6 years ago

Actually I think we want this: both parallel segments querying and async caching are about trading throughput for latency?

[Legacy Jira: Adrien Grand (@jpountz) on Mar 21 2018]

mikemccand commented 6 years ago

Indeed both are about trading throughput for latency.

However, there is a quantitative difference:

in parallel segments querying you would slice your index to e.g. 5 slices on each and every query.

async caching would happen only when caching is needed, and even then, only when the ratio between the caching cost and the leading query cost is big enough to justify async execution.

I would expect the additional async tasks triggered by async caching to be 100x less than parallel segments querying tasks.

Coupling these features together would mean that if someone is not willing to pay the overhead of parallel segments querying, he will not be able to use async caching.

[Legacy Jira: Amir Hadadi on Mar 21 2018]

mikemccand commented 6 years ago

You can still create an IndexSearcher with a single slice and/or wrap the cache to use a custom threadpool instead of the one from IndexSearcher, this is fine. I'd just like to make common use-cases, such as trading throughput for latency, simple.

[Legacy Jira: Adrien Grand (@jpountz) on Mar 21 2018]

mikemccand commented 5 years ago

I raised a PR for the same. The performance number from newly enhanced luceneutil for wikimedium10m are:

 

Latencies:

Task ('Wildcard', None) P50 Base 2.045201 P50 Cmp 2.089539 Pct Diff 2.16790427933 P90 Base 18.845334 P90 Cmp 35.346911 Pct Diff 87.5631973411 P99 Base 83.02344 P99 Cmp 48.300884 Pct Diff -41.8225937157 P999 Base 83.02344 P999 Cmp 48.300884 Pct Diff -41.8225937157 P100 Base 249.902876 P100 Cmp 87.512667 Pct Diff -64.9813285862
Task ('HighTermDayOfYearSort', 'DayOfYear') P50 Base 4.295828 P50 Cmp 4.727759 Pct Diff 10.0546623375 P90 Base 9.037488 P90 Cmp 55.639159 Pct Diff 515.648496573 P99 Base 82.149576 P99 Cmp 81.261365 Pct Diff -1.08121191033 P999 Base 82.149576 P999 Cmp 81.261365 Pct Diff -1.08121191033 P100 Base 86.642014 P100 Cmp 168.84768 Pct Diff 94.8796804285
Task ('MedSloppyPhrase', None) P50 Base 9.18549 P50 Cmp 8.683321 Pct Diff -5.46698107559 P90 Base 29.233836 P90 Cmp 30.984274 Pct Diff 5.98771232075 P99 Base 34.303039 P99 Cmp 35.978633 Pct Diff 4.88468091705 P999 Base 34.303039 P999 Cmp 35.978633 Pct Diff 4.88468091705 P100 Base 181.426025 P100 Cmp 261.742214 Pct Diff 44.2693869306
Task ('OrHighHigh', None) P50 Base 20.997779 P50 Cmp 16.938239 Pct Diff -19.3331875719 P90 Base 26.989668 P90 Cmp 29.711731 Pct Diff 10.0855742279 P99 Base 71.1345 P99 Cmp 72.914457 Pct Diff 2.50224152837 P999 Base 71.1345 P999 Cmp 72.914457 Pct Diff 2.50224152837 P100 Base 288.85441 P100 Cmp 203.02949 Pct Diff -29.7121723016
Task ('MedPhrase', None) P50 Base 6.935508 P50 Cmp 6.676061 Pct Diff -3.74085070625 P90 Base 8.834132 P90 Cmp 7.366097 Pct Diff -16.6177616545 P99 Base 61.645788 P99 Cmp 59.423887 Pct Diff -3.60430302229 P999 Base 61.645788 P999 Cmp 59.423887 Pct Diff -3.60430302229 P100 Base 65.592528 P100 Cmp 63.493249 Pct Diff -3.20048496987
Task ('LowSpanNear', None) P50 Base 23.256239 P50 Cmp 23.17936 Pct Diff -0.330573658105 P90 Base 33.890598 P90 Cmp 34.205568 Pct Diff 0.929372801271 P99 Base 34.958863 P99 Cmp 34.857876 Pct Diff -0.288873811485 P999 Base 34.958863 P999 Cmp 34.857876 Pct Diff -0.288873811485 P100 Base 96.937787 P100 Cmp 121.889403 Pct Diff 25.7398242442
Task ('Fuzzy2', None) P50 Base 25.45292 P50 Cmp 25.25128 Pct Diff -0.792207730979 P90 Base 79.376572 P90 Cmp 106.649481 Pct Diff 34.3588899254 P99 Base 108.933154 P99 Cmp 122.051216 Pct Diff 12.0423044026 P999 Base 108.933154 P999 Cmp 122.051216 Pct Diff 12.0423044026 P100 Base 212.373308 P100 Cmp 209.138442 Pct Diff -1.52319800942
Task ('OrNotHighHigh', None) P50 Base 1.903331 P50 Cmp 2.16024 Pct Diff 13.4978624317 P90 Base 4.890325 P90 Cmp 4.723459 Pct Diff -3.4121658581 P99 Base 102.556452 P99 Cmp 102.641448 Pct Diff 0.0828772820651 P999 Base 102.556452 P999 Cmp 102.641448 Pct Diff 0.0828772820651 P100 Base 226.783706 P100 Cmp 308.709148 Pct Diff 36.1249242483
Task ('OrHighNotLow', None) P50 Base 1.434646 P50 Cmp 1.52378 Pct Diff 6.21296124619 P90 Base 3.905319 P90 Cmp 4.569729 Pct Diff 17.0129507986 P99 Base 6.321682 P99 Cmp 7.281513 Pct Diff 15.1831585328 P999 Base 6.321682 P999 Cmp 7.281513 Pct Diff 15.1831585328 P100 Base 7.720665 P100 Cmp 15.035781 Pct Diff 94.7472270847
Task ('BrowseMonthSSDVFacets', None) P50 Base 93.940495 P50 Cmp 93.939183 Pct Diff -0.00139662879145 P90 Base 102.50354 P90 Cmp 98.604983 Pct Diff -3.80333888956 P99 Base 103.572854 P99 Cmp 106.785928 Pct Diff 3.10223564951 P999 Base 103.572854 P999 Cmp 106.785928 Pct Diff 3.10223564951 P100 Base 283.457123 P100 Cmp 244.054099 Pct Diff -13.9008762888
Task ('Fuzzy1', None) P50 Base 26.559456 P50 Cmp 29.050383 Pct Diff 9.37868230434 P90 Base 159.424881 P90 Cmp 171.063113 Pct Diff 7.30013529068 P99 Base 339.7673 P99 Cmp 179.733118 Pct Diff -47.1011136151 P999 Base 339.7673 P999 Cmp 179.733118 Pct Diff -47.1011136151 P100 Base 417.349072 P100 Cmp 395.168736 Pct Diff -5.31457657105
Task ('HighSloppyPhrase', None) P50 Base 9.489382 P50 Cmp 9.980939 Pct Diff 5.18007389733 P90 Base 14.424659 P90 Cmp 15.315198 Pct Diff 6.17372653315 P99 Base 37.046395 P99 Cmp 31.348423 Pct Diff -15.380638251 P999 Base 37.046395 P999 Cmp 31.348423 Pct Diff -15.380638251 P100 Base 51.797966 P100 Cmp 33.660774 Pct Diff -35.0152590934
Task ('OrNotHighMed', None) P50 Base 1.605631 P50 Cmp 1.549948 Pct Diff -3.46798236955 P90 Base 16.030506 P90 Cmp 11.175798 Pct Diff -30.2841844169 P99 Base 63.933462 P99 Cmp 63.33348 Pct Diff -0.938447537848 P999 Base 63.933462 P999 Cmp 63.33348 Pct Diff -0.938447537848 P100 Base 176.946354 P100 Cmp 141.426012 Pct Diff -20.0740739761
Task ('LowPhrase', None) P50 Base 5.294719 P50 Cmp 5.398776 Pct Diff 1.96529787511 P90 Base 14.581045 P90 Cmp 14.201875 Pct Diff -2.6004309019 P99 Base 28.226324 P99 Cmp 17.160324 Pct Diff -39.2045382884 P999 Base 28.226324 P999 Cmp 17.160324 Pct Diff -39.2045382884 P100 Base 170.879755 P100 Cmp 141.494236 Pct Diff -17.1966064675
Task ('AndHighLow', None) P50 Base 1.407978 P50 Cmp 1.356986 Pct Diff -3.62164749733 P90 Base 4.452746 P90 Cmp 6.676942 Pct Diff 49.9511088214 P99 Base 69.005964 P99 Cmp 69.618355 Pct Diff 0.887446482162 P999 Base 69.005964 P999 Cmp 69.618355 Pct Diff 0.887446482162 P100 Base 302.690572 P100 Cmp 288.210249 Pct Diff -4.7838698458
Task ('LowTerm', None) P50 Base 0.94373 P50 Cmp 0.984236 Pct Diff 4.29211744885 P90 Base 1.776916 P90 Cmp 1.95572 Pct Diff 10.0626028467 P99 Base 2.130259 P99 Cmp 2.967897 Pct Diff 39.3209464201 P999 Base 2.130259 P999 Cmp 2.967897 Pct Diff 39.3209464201 P100 Base 3.471183 P100 Cmp 11.055948 Pct Diff 218.50663016
Task ('BrowseDayOfYearSSDVFacets', None) P50 Base 126.336025 P50 Cmp 125.101777 Pct Diff -0.976956493605 P90 Base 135.381181 P90 Cmp 137.867944 Pct Diff 1.8368601763 P99 Base 144.330923 P99 Cmp 144.977097 Pct Diff 0.447703088547 P999 Base 144.330923 P999 Cmp 144.977097 Pct Diff 0.447703088547 P100 Base 295.783063 P100 Cmp 302.394898 Pct Diff 2.23536632995
Task ('AndHighHigh', None) P50 Base 5.089657 P50 Cmp 5.20707 Pct Diff 2.30689415809 P90 Base 6.822905 P90 Cmp 8.306661 Pct Diff 21.7466900096 P99 Base 7.038109 P99 Cmp 9.130519 Pct Diff 29.7297185934 P999 Base 7.038109 P999 Cmp 9.130519 Pct Diff 29.7297185934 P100 Base 111.532214 P100 Cmp 36.808683 Pct Diff -66.9972632302
Task ('HighTermMonthSort', 'Month') P50 Base 2.42148 P50 Cmp 2.464948 Pct Diff 1.79510051704 P90 Base 4.411412 P90 Cmp 4.012971 Pct Diff -9.03205141574 P99 Base 4.993632 P99 Cmp 4.667476 Pct Diff -6.53143844 P999 Base 4.993632 P999 Cmp 4.667476 Pct Diff -6.53143844 P100 Base 98.675367 P100 Cmp 94.641052 Pct Diff -4.08847225265
Task Respell P50 Base 17.339473 P50 Cmp 16.495659 Pct Diff -4.86643394525 P90 Base 202.898846 P90 Cmp 190.452875 Pct Diff -6.13407678031 P99 Base 232.699615 P99 Cmp 206.63496 Pct Diff -11.2009875908 P999 Base 232.699615 P999 Cmp 206.63496 Pct Diff -11.2009875908 P100 Base 259.083639 P100 Cmp 261.411887 Pct Diff 0.89864725113
Task ('MedTerm', None) P50 Base 1.044861 P50 Cmp 1.046421 Pct Diff 0.149302155981 P90 Base 10.095498 P90 Cmp 2.086543 Pct Diff -79.3319457842 P99 Base 31.02937 P99 Cmp 4.311224 Pct Diff -86.1059892611 P999 Base 31.02937 P999 Cmp 4.311224 Pct Diff -86.1059892611 P100 Base 37.048313 P100 Cmp 30.992555 Pct Diff -16.3455701748
Task ('HighIntervalsOrdered', None) P50 Base 18.204528 P50 Cmp 18.50288 Pct Diff 1.63888896213 P90 Base 28.38481 P90 Cmp 28.186101 Pct Diff -0.70005400776 P99 Base 28.553312 P99 Cmp 30.147999 Pct Diff 5.58494580243 P999 Base 28.553312 P999 Cmp 30.147999 Pct Diff 5.58494580243 P100 Base 85.526627 P100 Cmp 248.898567 Pct Diff 191.018804004
Task ('HighSpanNear', None) P50 Base 19.317909 P50 Cmp 18.448229 Pct Diff -4.50193651911 P90 Base 26.886654 P90 Cmp 27.011485 Pct Diff 0.464286110127 P99 Base 27.066426 P99 Cmp 28.519568 Pct Diff 5.36879896888 P999 Base 27.066426 P999 Cmp 28.519568 Pct Diff 5.36879896888 P100 Base 27.949757 P100 Cmp 28.532982 Pct Diff 2.08669077159
Task ('LowSloppyPhrase', None) P50 Base 17.995202 P50 Cmp 17.321666 Pct Diff -3.74286434795 P90 Base 27.136218 P90 Cmp 23.990793 Pct Diff -11.5912431128 P99 Base 27.177319 P99 Cmp 25.467183 Pct Diff -6.29251178161 P999 Base 27.177319 P999 Cmp 25.467183 Pct Diff -6.29251178161 P100 Base 37.041781 P100 Cmp 26.387058 Pct Diff -28.7640677968
Task ('OrHighNotHigh', None) P50 Base 1.721541 P50 Cmp 1.683098 Pct Diff -2.23305747583 P90 Base 2.334157 P90 Cmp 2.147467 Pct Diff -7.99817664364 P99 Base 3.127609 P99 Cmp 3.205909 Pct Diff 2.50350986968 P999 Base 3.127609 P999 Cmp 3.205909 Pct Diff 2.50350986968 P100 Base 10.177133 P100 Cmp 5.218394 Pct Diff -48.7243214764
Task ('OrNotHighLow', None) P50 Base 1.467553 P50 Cmp 1.4717 Pct Diff 0.282579232232 P90 Base 11.130311 P90 Cmp 9.27055 Pct Diff -16.7089760565 P99 Base 12.92663 P99 Cmp 12.305485 Pct Diff -4.80515803423 P999 Base 12.92663 P999 Cmp 12.305485 Pct Diff -4.80515803423 P100 Base 41.801941 P100 Cmp 42.736573 Pct Diff 2.23585789952
Task ('MedSpanNear', None) P50 Base 8.050703 P50 Cmp 8.0656 Pct Diff 0.18503974125 P90 Base 12.37128 P90 Cmp 11.972162 Pct Diff -3.22616576458 P99 Base 13.276762 P99 Cmp 13.410559 Pct Diff 1.00775324586 P999 Base 13.276762 P999 Cmp 13.410559 Pct Diff 1.00775324586 P100 Base 389.340697 P100 Cmp 524.654134 Pct Diff 34.7545062827
Task ('BrowseDateTaxoFacets', None) P50 Base 429.062505 P50 Cmp 428.859358 Pct Diff -0.0473467146704 P90 Base 436.499827 P90 Cmp 441.680746 Pct Diff 1.18692349447 P99 Base 458.955056 P99 Cmp 443.442175 Pct Diff -3.38004360061 P999 Base 458.955056 P999 Cmp 443.442175 Pct Diff -3.38004360061 P100 Base 516.629721 P100 Cmp 518.247598 Pct Diff 0.313159877227
Task ('HighPhrase', None) P50 Base 4.346807 P50 Cmp 4.173837 Pct Diff -3.9792426947 P90 Base 5.762266 P90 Cmp 6.035056 Pct Diff 4.7340751017 P99 Base 7.09179 P99 Cmp 8.179865 Pct Diff 15.3427413953 P999 Base 7.09179 P999 Cmp 8.179865 Pct Diff 15.3427413953 P100 Base 25.682734 P100 Cmp 26.872466 Pct Diff 4.63241958586
Task ('OrHighLow', None) P50 Base 2.360377 P50 Cmp 2.373989 Pct Diff 0.576687537626 P90 Base 2.920517 P90 Cmp 4.037781 Pct Diff 38.2556923997 P99 Base 4.226431 P99 Cmp 4.511819 Pct Diff 6.75245851642 P999 Base 4.226431 P999 Cmp 4.511819 Pct Diff 6.75245851642 P100 Base 48.928035 P100 Cmp 13.15068 Pct Diff -73.1224031376
Task ('BrowseDayOfYearTaxoFacets', None) P50 Base 0.168513 P50 Cmp 0.182987 Pct Diff 8.58924830725 P90 Base 0.204691 P90 Cmp 0.247465 Pct Diff 20.8968640536 P99 Base 15.455504 P99 Cmp 15.315246 Pct Diff -0.907495478633 P999 Base 15.455504 P999 Cmp 15.315246 Pct Diff -0.907495478633 P100 Base 16.866213 P100 Cmp 19.937509 Pct Diff 18.2097546142
Task ('OrHighMed', None) P50 Base 3.574777 P50 Cmp 3.231059 Pct Diff -9.6150892769 P90 Base 24.188598 P90 Cmp 4.638494 Pct Diff -80.8236343421 P99 Base 46.693976 P99 Cmp 8.113618 Pct Diff -82.6238442406 P999 Base 46.693976 P999 Cmp 8.113618 Pct Diff -82.6238442406 P100 Base 269.409786 P100 Cmp 23.431229 Pct Diff -91.3027550529
Task ('IntNRQ', None) P50 Base 4.011057 P50 Cmp 4.923845 Pct Diff 22.7567945307 P90 Base 13.346714 P90 Cmp 12.29044 Pct Diff -7.91411279211 P99 Base 85.073004 P99 Cmp 67.905295 Pct Diff -20.1799727208 P999 Base 85.073004 P999 Cmp 67.905295 Pct Diff -20.1799727208 P100 Base 191.744926 P100 Cmp 194.853114 Pct Diff 1.62100143396
Task ('BrowseMonthTaxoFacets', None) P50 Base 0.170601 P50 Cmp 0.172844 Pct Diff 1.31476368837 P90 Base 0.212197 P90 Cmp 0.260394 Pct Diff 22.7133277096 P99 Base 0.227793 P99 Cmp 4.692225 Pct Diff 1959.86356034 P999 Base 0.227793 P999 Cmp 4.692225 Pct Diff 1959.86356034 P100 Base 10.00639 P100 Cmp 9.966375 Pct Diff -0.399894467435
Task ('AndHighMed', None) P50 Base 3.367039 P50 Cmp 3.095126 Pct Diff -8.07573063454 P90 Base 6.445771 P90 Cmp 4.890087 Pct Diff -24.1349560821 P99 Base 6.646047 P99 Cmp 6.78425 Pct Diff 2.07947671751 P999 Base 6.646047 P999 Cmp 6.78425 Pct Diff 2.07947671751 P100 Base 42.892862 P100 Cmp 96.998233 Pct Diff 126.140734092
Task ('OrHighNotMed', None) P50 Base 1.75091 P50 Cmp 1.851148 Pct Diff 5.72490876173 P90 Base 3.059898 P90 Cmp 3.291761 Pct Diff 7.57747480472 P99 Base 5.304103 P99 Cmp 4.755972 Pct Diff -10.3340941909 P999 Base 5.304103 P999 Cmp 4.755972 Pct Diff -10.3340941909 P100 Base 21.934264 P100 Cmp 16.361938 Pct Diff -25.4046636805
Task ('Prefix3', None) P50 Base 15.061956 P50 Cmp 14.031055 Pct Diff -6.84440321031 P90 Base 28.636916 P90 Cmp 25.283538 Pct Diff -11.7099830163 P99 Base 425.912794 P99 Cmp 443.836014 Pct Diff 4.20819009255 P999 Base 425.912794 P999 Cmp 443.836014 Pct Diff 4.20819009255 P100 Base 497.136127 P100 Cmp 551.753833 Pct Diff 10.9864689033
Task PKLookup P50 Base 11.334723 P50 Cmp 10.857995 Pct Diff -4.20590781089 P90 Base 89.284901 P90 Cmp 32.062856 Pct Diff -64.0892741764 P99 Base 334.811782 P99 Cmp 73.551516 Pct Diff -78.0319809654 P999 Base 334.811782 P999 Cmp 73.551516 Pct Diff -78.0319809654 P100 Base 363.062965 P100 Cmp 418.731587 Pct Diff 15.3330489107
Task ('HighTerm', None) P50 Base 1.148459 P50 Cmp 1.212344 Pct Diff 5.56267137094 P90 Base 1.54 P90 Cmp 1.915795 Pct Diff 24.4022727273 P99 Base 2.138094 P99 Cmp 3.134936 Pct Diff 46.622926775 P999 Base 2.138094 P999 Cmp 3.134936 Pct Diff 46.622926775 P100 Base 2.722231 P100 Cmp 153.54749 Pct Diff 5540.50185308

 

[Legacy Jira: Atri Sharma (@atris) on Aug 05 2019]

mikemccand commented 5 years ago

Commit 0dfbf557bae67acaf9c920cb51a04b7c7341d621 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0dfbf55

LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache (#815)

Introduce asynchronous caching for heavy queries using the passed in Executor to IndexSearcher.

[Legacy Jira: ASF subversion and git services on Sep 28 2019]

mikemccand commented 5 years ago

Hi @atris,

This commit creates some CI errors:

ant test  -Dtestcase=TestLRUQueryCache -Dtests.method=testLRUEviction -Dtests.seed=DC3A333680E821EB -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=en-US -Dtests.timezone=Africa/Luanda -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

[Legacy Jira: Ignacio Vera (@iverase) on Sep 30 2019]

mikemccand commented 5 years ago

Taking a look – although I am unable to reproduce with a simple ant test.

Can you point me to the CI link so that I can dive deeper into the error output?

[Legacy Jira: Atri Sharma (@atris) on Sep 30 2019]

mikemccand commented 5 years ago

Here is the failure:

https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+master/5526/console

I had a look and it is a race condition. We are checking if the query has been cached just after executing a query but if doing it async, it might happen that the query is still not there.

[Legacy Jira: Ignacio Vera (@iverase) on Sep 30 2019]

mikemccand commented 5 years ago

Interesting – I did not realise that testLRUEviction could also cause LRUQueryCache to cache asynchronously, hence did not update it to handle the same (in the manner testLRUConcurrentLoadAndEviction does).

 

I have pushed a test fix now – beasted the test 50 times with the seed you provided, and also beasted the entire TestLRUQueryCache suite 20 times with the seed. Ran the entire Lucene test suite – came in clean.

 

It is curious to note that I could not reproduce the test failure without the seed even after running multiple times – kudos to the CI!

[Legacy Jira: Atri Sharma (@atris) on Sep 30 2019]

mikemccand commented 5 years ago

Kudos to randomized testing philosophy!

[Legacy Jira: David Smiley (@dsmiley) on Oct 01 2019]

mikemccand commented 5 years ago

Hi @atris,

We have another error in CI that I think it is related to this issue. You can reproduce it by running the following command:

ant beast -Dbeast.iters=10 -Dtests.dups=6  -Dtestcase=TestLatLonShape -Dtests.method=testLUCENE8736* -Dtests.seed=D784C9A46B62FB84 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=brx -Dtests.timezone=America/Guatemala -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

[Legacy Jira: Ignacio Vera (@iverase) on Oct 01 2019]

mikemccand commented 5 years ago

Opened https://github.com/apache/lucene-solr/pull/912

 

Curious –  looks like the query being cached is insanely heavy – it takes more than 200ms to cache!

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

@ivera , Regarding the other failure in TestLRUQueryCache, I investigated and that is a definite case of async caching in action – the cache's cache count will not be updated until the async loading completes, before which the test cannot check the value in a deterministic manner. The solution is to retry in case the cache count does not match the target. Making that change lead to a clean beast of TestLRUQueryCache 50 times. I will push the change shortly.

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

Wait @atris  

 

I don't think this is a proper solution as it means that depending when you query you will get a different result. I think is wrong as queries should always return the same hits regardless of the cache.

[Legacy Jira: Ignacio Vera (@iverase) on Oct 01 2019]

mikemccand commented 5 years ago

@ivera, I am not referring to the LatLonShape failure –  I agree with you that the test should be fine even with async caching since it directly checks the result count.

 

However, for testMinSegmentSizePredicate, the failing assertion is on the cache's cache count – which is directly dependent on the race between the call and the completion of the async load (as the count is incremented only when the async load completes).

 

Hence, for testMinSegmentSizePredicate, the correct fix would be to retry the check a few times before marking a test failure. WDYT?

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

The inFlightAsyncLoadQueries logic doesn't look right. It seems like it would prevent caching on one segment if another segment is already caching the same query?

Also we shouldn't rely on Thread#sleep in tests, can we get rid of them?

[Legacy Jira: Adrien Grand (@jpountz) on Oct 01 2019]

mikemccand commented 5 years ago

@jpountz Good catch, My understanding was that the cache was always per segment, so just using the query as a cache key should be fine. Thanks for highlighting this issue.

Should I revert this commit?

 

Would using the incoming LeafReaderContext's top level ord + the query's key as the key in inFlightAsyncLoadQueries suffice to ensure that segments get their independent space in the cache?

 

RE: Thread#sleep, I agree that it is not a pretty pattern. Maybe replacing a CountDownLatch will work?

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

Attached is a patch which reproduces the issue @jpountz   mentioned in a unit test. Using a Pair of Query + Segment's ID object as the Set (inFlightAsyncLoadQueries)'s key makes the test pass.

 

0001-Reproduce-across-segment-caching-of-same-query.patch

[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]

mikemccand commented 5 years ago

Should I revert this commit?

It would probably make things easier indeed, this way we can take the time to think about how to address the various issues.

Would using the incoming LeafReaderContext's top level ord + the query's key as the key in inFlightAsyncLoadQueries suffice to ensure that segments get their independent space in the cache?

I don't think the ord is enough as the query cache can be shared across multiple readers. This is what Elasticsearch does for instance, the cache is shared across all shards that are allocated on the same node. This issue is not specific to asynchronous caching, it could already happen before that the same query gets cached twice concurrently, which is wasteful. Maybe we should ignore this issue for now and work on addressing it as a follow-up.

RE: Thread#sleep, I agree that it is not a pretty pattern. Maybe replacing a CountDownLatch will work?

Anything that gives a signal when the caching is done instead of busy waiting would be better. One easy way to do this might be to shutdown the pool and call awaitTermination?

[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]

mikemccand commented 5 years ago

Hi @jpountz

It would probably make things easier indeed, this way we can take the time to think about how to address the various issues.

Sure, happy to revert. I managed to create a standalone unit test which reproduces this case, though, and posted a patch last night. I am in process of raising a PR with the fix and the test – would you want to take a look at that, or should I revert right away?

I don't think the ord is enough as the query cache can be shared across multiple readers

One way I hacked around this was to use the LeafReaderContext's ID field + the query as the unique identifier for the cache. This was inspired by the way CompositeReaderContext uses its ID field. Does this sounds a viable approach?

 

[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]

mikemccand commented 5 years ago

I'd rather revert to stop test failures as soon as possible. I'll take time to look at the PR later today.

Ideally, we'd use the core cache key of the reader? See LeafReader#getCoreCacheHelper.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]

mikemccand commented 5 years ago

Commit cd09808f6b54b33e350a0739dccb602cb270abad in lucene-solr's branch refs/heads/revert-815-LUCENE-8213 from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=cd09808

Revert "LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache (#815)"

This reverts commit 0dfbf557bae67acaf9c920cb51a04b7c7341d621.

[Legacy Jira: ASF subversion and git services on Oct 02 2019]

mikemccand commented 5 years ago

Commit 302cd09b4ce7bd3049f8480287b3dd03bb838b0d in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=302cd09

Revert "LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache (#815)" (#914)

This reverts commit 0dfbf557bae67acaf9c920cb51a04b7c7341d621.

[Legacy Jira: ASF subversion and git services on Oct 02 2019]

mikemccand commented 5 years ago

@jpountz Reverted.

 

I will use CacheHelper's core cache key + the query cache key as the inFlightAsyncMap's key

[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]

mikemccand commented 5 years ago

Can we do this as a separate change that applies to both the concurrent and the serial case?

[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]

mikemccand commented 5 years ago

@jpountz I am not sure I follow – inFlightAsyncQueries map will only be involved in the concurrent load case? I am not sure I get your thoughts behind fixing the serial case.

[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]

mikemccand commented 5 years ago

My thinking is that this is an attempt to avoid being wasteful by caching twice the same query concurrently. However this may not only occur when IndexSearcher is configured with an executor. This also happens when IndexSearcher is configured without an executor and two concurrent requests try to cache the same query. The cache simply ignores this issue today and we would generate a cache entry twice. So to make things easier to review I was suggesting ignoring this problem as well for the case that IndexSearcher is configured with an executor, and coming back later to this problem with a fix that works for all cases.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]

mikemccand commented 5 years ago

@jpountz Just noticed your latest comment – I raised a PR (https://github.com/apache/lucene-solr/pull/916) for this change – it includes the logic to use CacheKey + Query as the map's key as well (sorry about that).

Please let me know if this logic looks fine – I will raise another PR to use that combination as the key for the serial case.

[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]

mikemccand commented 5 years ago

FWIW: I just attached thetaphi_Lucene-Solr-master-Linux_24839.log.txt, a jenkins log that shows two other failures that seem to be related to this issue...

Checking out Revision 3c399bb696073bdb30f278309410a50effabd0e7 (refs/remotes/origin/master)
...
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestLatLonDocValuesQueries -Dtests.method=testAllLatEqual -Dtests.seed=8D56B48917EDB35F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=zh-Hans -Dtests.timezone=America/Bahia -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 2.67s J0 | TestLatLonDocValuesQueries.testAllLatEqual <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: wrong hit (first of possibly more):
   [junit4]    > FAIL: id=235 should match but did not
   [junit4]    >   query=point:polygons([[0.0, 1.401298464324817E-45] [35.0, 1.401298464324817E-45] [35.0, 180.0] [0.0, 180.0] [0.0, 1.401298464324817E-45] ]) docID=227
   [junit4]    >   lat=0.0 lon=41.82071185670793
   [junit4]    >   deleted?=false  polygon=[0.0, 1.401298464324817E-45] [35.0, 1.401298464324817E-45] [35.0, 180.0] [0.0, 180.0] [0.0, 1.401298464324817E-45] 
...
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestFieldCacheSortRandom -Dtests.method=testRandomStringSort -Dtests.seed=F7E42E6905E37945 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=khq-ML -Dtests.timezone=America/Chihuahua -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.15s J1 | TestFieldCacheSortRandom.testRandomStringSort <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: expected:<[65 76 65 64 6d 68 76 68 75 77 61 71 64 63 65 63 7a 79 77 77 63 69 71 62 76 70 7a 62 6d 66 75 67 61]> but was:<[66 6c 75 6e 78 63 67 66 63 77 7a 6b 69 6d 7a 77 62 6c 71 61 79 61 67 71 6f 69 66 71 64 6a 66]>

both of those seeds seem to fail reliably against GIT:3c399bb6960 (the version checked out by the jenkins run) but both seem to start passing reliably as of GIT:302cd09b4ce (when this issue was reverted)

I should point out however in case it's helpful: when the failures reproduce, the specifics of the failures are always different – suggesting that parallel thread execution is affecting the results, since the randomization of the index data should be deterministic based on the seed.

[Legacy Jira: Chris M. Hostetter on Oct 02 2019]

mikemccand commented 5 years ago

Thanks @hossman

Interestingly, all but one test failures are coming from LatLon queries – is there anything special about them?

[Legacy Jira: Atri Sharma (@atris) on Oct 03 2019]

mikemccand commented 5 years ago

I don't think these queries have anything special besides being a bit slow.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 04 2019]

mikemccand commented 5 years ago

Sorry to distract. I am trying to understand LRUQueryCache as it has a lot of excellent engineering invested into it. For asynchronous caching without duplicate loads, wouldn't a CompletableFuture as the cached value be suitable? The cache doesn't handle the cache stampedes, so the same entry may be computed twice with the first to insert as the winner, and the other threads dropping their work. Ideally by using memoization the other threads would defer to the first and avoid the costly operation. Similarly the cache lookup is skipped if the global lock is held and the value computed, but not stored, which means a busy lock reduces performance across all usages.

Would there be a benefit/liability in using a dedicated caching library like Caffeine? That does handle concurrency in a smoother fashion, which could be beneficial here. I am also curious if the hit ratios of LRU could be improved upon as search problems tend to have a strong frequency bias.

[Legacy Jira: Ben Manes on Nov 03 2019]

mikemccand commented 5 years ago

~ben.manes I believe this would work. Regarding Caffeine, I commented on the other issue.

[Legacy Jira: Adrien Grand (@jpountz) on Nov 07 2019]

mikemccand commented 5 years ago

If you want to give it a try, I'd be happy to review it.

[Legacy Jira: Adrien Grand (@jpountz) on Nov 07 2019]

mikemccand commented 4 years ago

Commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2d98a61

LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)

[Legacy Jira: ASF subversion and git services on Nov 26 2019]

mikemccand commented 4 years ago

Commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2d98a61

LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)

[Legacy Jira: ASF subversion and git services on Nov 26 2019]

mikemccand commented 4 years ago

Commit 09cd82a6fc51b90275ed326812e1d947c863952e in lucene-solr's branch refs/heads/revert-916-revised_LUCENE-8213 from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=09cd82a

Revert "LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)"

This reverts commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49.

[Legacy Jira: ASF subversion and git services on Nov 28 2019]

mikemccand commented 4 years ago

Commit 165f6b276c021b3328e3a37dc125f666ece458d4 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=165f6b2

Revert "LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)" (#1044)

This reverts commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49.

[Legacy Jira: ASF subversion and git services on Nov 28 2019]