Open mikemccand opened 6 years ago
This problem applies to most queries actually. Creating cache entries asynchronously to prevent query caching from hurting latency is an interesting idea.
[Legacy Jira: Adrien Grand (@jpountz) on Mar 17 2018]
@jpountz, maxCostFactor can be used to decide whether to cache synchronously or asynchronously. Its default value can then be lowered from 100 to something close to 1.
WDYT?
[Legacy Jira: Amir Hadadi on Mar 18 2018]
It causes a latency hit regardless: using a threadpool doesnt matter. I don't think we should introduce additional threadpools because servers using lucene already use FAR too many threads, sorry.
If the user passes an executor to IndexSearcher, they've already said they don't have much throughput and they'd rather trade it for latency. So in that case i think its fine to use that threadpool for this.
[Legacy Jira: Robert Muir (@rmuir) on Mar 18 2018]
"It causes a latency hit regardless"
@rmuir I will clarify the issue. Assume you have these queries:
q1: term query that matches 1 doc
q2: range query that matches 10M docs
And you query for "q1 AND q2"
q2 would be an IndexOrDocValuesQuery and would use doc values to do a range check on the single doc that matches q1, so the BKD tree won't be scanned for 10M documents.
However, when q2 gets cached, the bit set for the entire 10M docs is cached, which causes "q1 AND q2" to suddenly spike in latency.
[Legacy Jira: Amir Hadadi on Mar 18 2018]
Your description sounds like maybe a different problem, where IndexOrDocValuesQuery chooses the wrong execution path to fill the "filter" bitset? If it uses docvalues instead of a BKD tree, because it must cache a large range that definitely seems like its bad.
[Legacy Jira: Robert Muir (@rmuir) on Mar 19 2018]
@rcmuir the issue is that the execution path for (q1 AND q2) depends on whether q2 gets cached or not. When q2 does not get cached, doc values are used to execute q2 and only the single document matching q1 is evaluated against the range. When q2 gets cached, it gets cached as a query that stands by itself, i.e. not in the context of (q1 AND q2). So the entire 10M documents that q2 matches are scanned in the BKD tree and get cached to a bit set. To protect against the caching of q2 causing the latency of (q1 AND q2) to be too high, @jpountz added maxCostFactor. This factor checks whether the cost of caching q2 is higher by more than maxCostFactor than the cost of evaluating (q1 AND q2). This is the relevant code from LRUQueryCache:
double costFactor = (double) inSupplier.cost() / leadCost;
if (costFactor >= maxCostFactor) {
// too costly, caching might make the query much slower
return inSupplier.get(leadCost);
}
My suggestion is to always evaluate (q1 AND q2) using the optimal execution path, and cache q2 asynchrounously. A refinement is to cache q2 synchronously if the cost of caching it is not too high.
[Legacy Jira: Amir Hadadi on Mar 19 2018]
I didn't remember pushing this maxCostFactor
prototype I had been working on and indeed I had pushed it inadvertently with changes on LUCENE-8008. It has one issue that I didn't like which is that you can't use the bulk scorer for caching anymore, which matters for disjunctions.
Regarding this issue, maybe one way to make it use the executor from IndexSearcher would be to add an executor parameter to QueryCache.doCache so that IndexSearcher can pass its executor to the cache.
[Legacy Jira: Adrien Grand (@jpountz) on Mar 20 2018]
@jpountz Currently if an executor is used in IndexSearcher, it means that segments should be queried in parallel.
I guess you wouldn't want to conflate parallel segments querying and async caching.
[Legacy Jira: Amir Hadadi on Mar 20 2018]
Actually I think we want this: both parallel segments querying and async caching are about trading throughput for latency?
[Legacy Jira: Adrien Grand (@jpountz) on Mar 21 2018]
Indeed both are about trading throughput for latency.
However, there is a quantitative difference:
in parallel segments querying you would slice your index to e.g. 5 slices on each and every query.
async caching would happen only when caching is needed, and even then, only when the ratio between the caching cost and the leading query cost is big enough to justify async execution.
I would expect the additional async tasks triggered by async caching to be 100x less than parallel segments querying tasks.
Coupling these features together would mean that if someone is not willing to pay the overhead of parallel segments querying, he will not be able to use async caching.
[Legacy Jira: Amir Hadadi on Mar 21 2018]
You can still create an IndexSearcher with a single slice and/or wrap the cache to use a custom threadpool instead of the one from IndexSearcher, this is fine. I'd just like to make common use-cases, such as trading throughput for latency, simple.
[Legacy Jira: Adrien Grand (@jpountz) on Mar 21 2018]
I raised a PR for the same. The performance number from newly enhanced luceneutil for wikimedium10m are:
Latencies:
Task ('Wildcard', None) | P50 Base 2.045201 | P50 Cmp 2.089539 | Pct Diff 2.16790427933 | P90 Base 18.845334 | P90 Cmp 35.346911 | Pct Diff 87.5631973411 | P99 Base 83.02344 | P99 Cmp 48.300884 | Pct Diff -41.8225937157 | P999 Base 83.02344 | P999 Cmp 48.300884 | Pct Diff -41.8225937157 | P100 Base 249.902876 | P100 Cmp 87.512667 | Pct Diff -64.9813285862 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Task ('HighTermDayOfYearSort', 'DayOfYear') | P50 Base 4.295828 | P50 Cmp 4.727759 | Pct Diff 10.0546623375 | P90 Base 9.037488 | P90 Cmp 55.639159 | Pct Diff 515.648496573 | P99 Base 82.149576 | P99 Cmp 81.261365 | Pct Diff -1.08121191033 | P999 Base 82.149576 | P999 Cmp 81.261365 | Pct Diff -1.08121191033 | P100 Base 86.642014 | P100 Cmp 168.84768 | Pct Diff 94.8796804285 |
Task ('MedSloppyPhrase', None) | P50 Base 9.18549 | P50 Cmp 8.683321 | Pct Diff -5.46698107559 | P90 Base 29.233836 | P90 Cmp 30.984274 | Pct Diff 5.98771232075 | P99 Base 34.303039 | P99 Cmp 35.978633 | Pct Diff 4.88468091705 | P999 Base 34.303039 | P999 Cmp 35.978633 | Pct Diff 4.88468091705 | P100 Base 181.426025 | P100 Cmp 261.742214 | Pct Diff 44.2693869306 |
Task ('OrHighHigh', None) | P50 Base 20.997779 | P50 Cmp 16.938239 | Pct Diff -19.3331875719 | P90 Base 26.989668 | P90 Cmp 29.711731 | Pct Diff 10.0855742279 | P99 Base 71.1345 | P99 Cmp 72.914457 | Pct Diff 2.50224152837 | P999 Base 71.1345 | P999 Cmp 72.914457 | Pct Diff 2.50224152837 | P100 Base 288.85441 | P100 Cmp 203.02949 | Pct Diff -29.7121723016 |
Task ('MedPhrase', None) | P50 Base 6.935508 | P50 Cmp 6.676061 | Pct Diff -3.74085070625 | P90 Base 8.834132 | P90 Cmp 7.366097 | Pct Diff -16.6177616545 | P99 Base 61.645788 | P99 Cmp 59.423887 | Pct Diff -3.60430302229 | P999 Base 61.645788 | P999 Cmp 59.423887 | Pct Diff -3.60430302229 | P100 Base 65.592528 | P100 Cmp 63.493249 | Pct Diff -3.20048496987 |
Task ('LowSpanNear', None) | P50 Base 23.256239 | P50 Cmp 23.17936 | Pct Diff -0.330573658105 | P90 Base 33.890598 | P90 Cmp 34.205568 | Pct Diff 0.929372801271 | P99 Base 34.958863 | P99 Cmp 34.857876 | Pct Diff -0.288873811485 | P999 Base 34.958863 | P999 Cmp 34.857876 | Pct Diff -0.288873811485 | P100 Base 96.937787 | P100 Cmp 121.889403 | Pct Diff 25.7398242442 |
Task ('Fuzzy2', None) | P50 Base 25.45292 | P50 Cmp 25.25128 | Pct Diff -0.792207730979 | P90 Base 79.376572 | P90 Cmp 106.649481 | Pct Diff 34.3588899254 | P99 Base 108.933154 | P99 Cmp 122.051216 | Pct Diff 12.0423044026 | P999 Base 108.933154 | P999 Cmp 122.051216 | Pct Diff 12.0423044026 | P100 Base 212.373308 | P100 Cmp 209.138442 | Pct Diff -1.52319800942 |
Task ('OrNotHighHigh', None) | P50 Base 1.903331 | P50 Cmp 2.16024 | Pct Diff 13.4978624317 | P90 Base 4.890325 | P90 Cmp 4.723459 | Pct Diff -3.4121658581 | P99 Base 102.556452 | P99 Cmp 102.641448 | Pct Diff 0.0828772820651 | P999 Base 102.556452 | P999 Cmp 102.641448 | Pct Diff 0.0828772820651 | P100 Base 226.783706 | P100 Cmp 308.709148 | Pct Diff 36.1249242483 |
Task ('OrHighNotLow', None) | P50 Base 1.434646 | P50 Cmp 1.52378 | Pct Diff 6.21296124619 | P90 Base 3.905319 | P90 Cmp 4.569729 | Pct Diff 17.0129507986 | P99 Base 6.321682 | P99 Cmp 7.281513 | Pct Diff 15.1831585328 | P999 Base 6.321682 | P999 Cmp 7.281513 | Pct Diff 15.1831585328 | P100 Base 7.720665 | P100 Cmp 15.035781 | Pct Diff 94.7472270847 |
Task ('BrowseMonthSSDVFacets', None) | P50 Base 93.940495 | P50 Cmp 93.939183 | Pct Diff -0.00139662879145 | P90 Base 102.50354 | P90 Cmp 98.604983 | Pct Diff -3.80333888956 | P99 Base 103.572854 | P99 Cmp 106.785928 | Pct Diff 3.10223564951 | P999 Base 103.572854 | P999 Cmp 106.785928 | Pct Diff 3.10223564951 | P100 Base 283.457123 | P100 Cmp 244.054099 | Pct Diff -13.9008762888 |
Task ('Fuzzy1', None) | P50 Base 26.559456 | P50 Cmp 29.050383 | Pct Diff 9.37868230434 | P90 Base 159.424881 | P90 Cmp 171.063113 | Pct Diff 7.30013529068 | P99 Base 339.7673 | P99 Cmp 179.733118 | Pct Diff -47.1011136151 | P999 Base 339.7673 | P999 Cmp 179.733118 | Pct Diff -47.1011136151 | P100 Base 417.349072 | P100 Cmp 395.168736 | Pct Diff -5.31457657105 |
Task ('HighSloppyPhrase', None) | P50 Base 9.489382 | P50 Cmp 9.980939 | Pct Diff 5.18007389733 | P90 Base 14.424659 | P90 Cmp 15.315198 | Pct Diff 6.17372653315 | P99 Base 37.046395 | P99 Cmp 31.348423 | Pct Diff -15.380638251 | P999 Base 37.046395 | P999 Cmp 31.348423 | Pct Diff -15.380638251 | P100 Base 51.797966 | P100 Cmp 33.660774 | Pct Diff -35.0152590934 |
Task ('OrNotHighMed', None) | P50 Base 1.605631 | P50 Cmp 1.549948 | Pct Diff -3.46798236955 | P90 Base 16.030506 | P90 Cmp 11.175798 | Pct Diff -30.2841844169 | P99 Base 63.933462 | P99 Cmp 63.33348 | Pct Diff -0.938447537848 | P999 Base 63.933462 | P999 Cmp 63.33348 | Pct Diff -0.938447537848 | P100 Base 176.946354 | P100 Cmp 141.426012 | Pct Diff -20.0740739761 |
Task ('LowPhrase', None) | P50 Base 5.294719 | P50 Cmp 5.398776 | Pct Diff 1.96529787511 | P90 Base 14.581045 | P90 Cmp 14.201875 | Pct Diff -2.6004309019 | P99 Base 28.226324 | P99 Cmp 17.160324 | Pct Diff -39.2045382884 | P999 Base 28.226324 | P999 Cmp 17.160324 | Pct Diff -39.2045382884 | P100 Base 170.879755 | P100 Cmp 141.494236 | Pct Diff -17.1966064675 |
Task ('AndHighLow', None) | P50 Base 1.407978 | P50 Cmp 1.356986 | Pct Diff -3.62164749733 | P90 Base 4.452746 | P90 Cmp 6.676942 | Pct Diff 49.9511088214 | P99 Base 69.005964 | P99 Cmp 69.618355 | Pct Diff 0.887446482162 | P999 Base 69.005964 | P999 Cmp 69.618355 | Pct Diff 0.887446482162 | P100 Base 302.690572 | P100 Cmp 288.210249 | Pct Diff -4.7838698458 |
Task ('LowTerm', None) | P50 Base 0.94373 | P50 Cmp 0.984236 | Pct Diff 4.29211744885 | P90 Base 1.776916 | P90 Cmp 1.95572 | Pct Diff 10.0626028467 | P99 Base 2.130259 | P99 Cmp 2.967897 | Pct Diff 39.3209464201 | P999 Base 2.130259 | P999 Cmp 2.967897 | Pct Diff 39.3209464201 | P100 Base 3.471183 | P100 Cmp 11.055948 | Pct Diff 218.50663016 |
Task ('BrowseDayOfYearSSDVFacets', None) | P50 Base 126.336025 | P50 Cmp 125.101777 | Pct Diff -0.976956493605 | P90 Base 135.381181 | P90 Cmp 137.867944 | Pct Diff 1.8368601763 | P99 Base 144.330923 | P99 Cmp 144.977097 | Pct Diff 0.447703088547 | P999 Base 144.330923 | P999 Cmp 144.977097 | Pct Diff 0.447703088547 | P100 Base 295.783063 | P100 Cmp 302.394898 | Pct Diff 2.23536632995 |
Task ('AndHighHigh', None) | P50 Base 5.089657 | P50 Cmp 5.20707 | Pct Diff 2.30689415809 | P90 Base 6.822905 | P90 Cmp 8.306661 | Pct Diff 21.7466900096 | P99 Base 7.038109 | P99 Cmp 9.130519 | Pct Diff 29.7297185934 | P999 Base 7.038109 | P999 Cmp 9.130519 | Pct Diff 29.7297185934 | P100 Base 111.532214 | P100 Cmp 36.808683 | Pct Diff -66.9972632302 |
Task ('HighTermMonthSort', 'Month') | P50 Base 2.42148 | P50 Cmp 2.464948 | Pct Diff 1.79510051704 | P90 Base 4.411412 | P90 Cmp 4.012971 | Pct Diff -9.03205141574 | P99 Base 4.993632 | P99 Cmp 4.667476 | Pct Diff -6.53143844 | P999 Base 4.993632 | P999 Cmp 4.667476 | Pct Diff -6.53143844 | P100 Base 98.675367 | P100 Cmp 94.641052 | Pct Diff -4.08847225265 |
Task Respell | P50 Base 17.339473 | P50 Cmp 16.495659 | Pct Diff -4.86643394525 | P90 Base 202.898846 | P90 Cmp 190.452875 | Pct Diff -6.13407678031 | P99 Base 232.699615 | P99 Cmp 206.63496 | Pct Diff -11.2009875908 | P999 Base 232.699615 | P999 Cmp 206.63496 | Pct Diff -11.2009875908 | P100 Base 259.083639 | P100 Cmp 261.411887 | Pct Diff 0.89864725113 |
Task ('MedTerm', None) | P50 Base 1.044861 | P50 Cmp 1.046421 | Pct Diff 0.149302155981 | P90 Base 10.095498 | P90 Cmp 2.086543 | Pct Diff -79.3319457842 | P99 Base 31.02937 | P99 Cmp 4.311224 | Pct Diff -86.1059892611 | P999 Base 31.02937 | P999 Cmp 4.311224 | Pct Diff -86.1059892611 | P100 Base 37.048313 | P100 Cmp 30.992555 | Pct Diff -16.3455701748 |
Task ('HighIntervalsOrdered', None) | P50 Base 18.204528 | P50 Cmp 18.50288 | Pct Diff 1.63888896213 | P90 Base 28.38481 | P90 Cmp 28.186101 | Pct Diff -0.70005400776 | P99 Base 28.553312 | P99 Cmp 30.147999 | Pct Diff 5.58494580243 | P999 Base 28.553312 | P999 Cmp 30.147999 | Pct Diff 5.58494580243 | P100 Base 85.526627 | P100 Cmp 248.898567 | Pct Diff 191.018804004 |
Task ('HighSpanNear', None) | P50 Base 19.317909 | P50 Cmp 18.448229 | Pct Diff -4.50193651911 | P90 Base 26.886654 | P90 Cmp 27.011485 | Pct Diff 0.464286110127 | P99 Base 27.066426 | P99 Cmp 28.519568 | Pct Diff 5.36879896888 | P999 Base 27.066426 | P999 Cmp 28.519568 | Pct Diff 5.36879896888 | P100 Base 27.949757 | P100 Cmp 28.532982 | Pct Diff 2.08669077159 |
Task ('LowSloppyPhrase', None) | P50 Base 17.995202 | P50 Cmp 17.321666 | Pct Diff -3.74286434795 | P90 Base 27.136218 | P90 Cmp 23.990793 | Pct Diff -11.5912431128 | P99 Base 27.177319 | P99 Cmp 25.467183 | Pct Diff -6.29251178161 | P999 Base 27.177319 | P999 Cmp 25.467183 | Pct Diff -6.29251178161 | P100 Base 37.041781 | P100 Cmp 26.387058 | Pct Diff -28.7640677968 |
Task ('OrHighNotHigh', None) | P50 Base 1.721541 | P50 Cmp 1.683098 | Pct Diff -2.23305747583 | P90 Base 2.334157 | P90 Cmp 2.147467 | Pct Diff -7.99817664364 | P99 Base 3.127609 | P99 Cmp 3.205909 | Pct Diff 2.50350986968 | P999 Base 3.127609 | P999 Cmp 3.205909 | Pct Diff 2.50350986968 | P100 Base 10.177133 | P100 Cmp 5.218394 | Pct Diff -48.7243214764 |
Task ('OrNotHighLow', None) | P50 Base 1.467553 | P50 Cmp 1.4717 | Pct Diff 0.282579232232 | P90 Base 11.130311 | P90 Cmp 9.27055 | Pct Diff -16.7089760565 | P99 Base 12.92663 | P99 Cmp 12.305485 | Pct Diff -4.80515803423 | P999 Base 12.92663 | P999 Cmp 12.305485 | Pct Diff -4.80515803423 | P100 Base 41.801941 | P100 Cmp 42.736573 | Pct Diff 2.23585789952 |
Task ('MedSpanNear', None) | P50 Base 8.050703 | P50 Cmp 8.0656 | Pct Diff 0.18503974125 | P90 Base 12.37128 | P90 Cmp 11.972162 | Pct Diff -3.22616576458 | P99 Base 13.276762 | P99 Cmp 13.410559 | Pct Diff 1.00775324586 | P999 Base 13.276762 | P999 Cmp 13.410559 | Pct Diff 1.00775324586 | P100 Base 389.340697 | P100 Cmp 524.654134 | Pct Diff 34.7545062827 |
Task ('BrowseDateTaxoFacets', None) | P50 Base 429.062505 | P50 Cmp 428.859358 | Pct Diff -0.0473467146704 | P90 Base 436.499827 | P90 Cmp 441.680746 | Pct Diff 1.18692349447 | P99 Base 458.955056 | P99 Cmp 443.442175 | Pct Diff -3.38004360061 | P999 Base 458.955056 | P999 Cmp 443.442175 | Pct Diff -3.38004360061 | P100 Base 516.629721 | P100 Cmp 518.247598 | Pct Diff 0.313159877227 |
Task ('HighPhrase', None) | P50 Base 4.346807 | P50 Cmp 4.173837 | Pct Diff -3.9792426947 | P90 Base 5.762266 | P90 Cmp 6.035056 | Pct Diff 4.7340751017 | P99 Base 7.09179 | P99 Cmp 8.179865 | Pct Diff 15.3427413953 | P999 Base 7.09179 | P999 Cmp 8.179865 | Pct Diff 15.3427413953 | P100 Base 25.682734 | P100 Cmp 26.872466 | Pct Diff 4.63241958586 |
Task ('OrHighLow', None) | P50 Base 2.360377 | P50 Cmp 2.373989 | Pct Diff 0.576687537626 | P90 Base 2.920517 | P90 Cmp 4.037781 | Pct Diff 38.2556923997 | P99 Base 4.226431 | P99 Cmp 4.511819 | Pct Diff 6.75245851642 | P999 Base 4.226431 | P999 Cmp 4.511819 | Pct Diff 6.75245851642 | P100 Base 48.928035 | P100 Cmp 13.15068 | Pct Diff -73.1224031376 |
Task ('BrowseDayOfYearTaxoFacets', None) | P50 Base 0.168513 | P50 Cmp 0.182987 | Pct Diff 8.58924830725 | P90 Base 0.204691 | P90 Cmp 0.247465 | Pct Diff 20.8968640536 | P99 Base 15.455504 | P99 Cmp 15.315246 | Pct Diff -0.907495478633 | P999 Base 15.455504 | P999 Cmp 15.315246 | Pct Diff -0.907495478633 | P100 Base 16.866213 | P100 Cmp 19.937509 | Pct Diff 18.2097546142 |
Task ('OrHighMed', None) | P50 Base 3.574777 | P50 Cmp 3.231059 | Pct Diff -9.6150892769 | P90 Base 24.188598 | P90 Cmp 4.638494 | Pct Diff -80.8236343421 | P99 Base 46.693976 | P99 Cmp 8.113618 | Pct Diff -82.6238442406 | P999 Base 46.693976 | P999 Cmp 8.113618 | Pct Diff -82.6238442406 | P100 Base 269.409786 | P100 Cmp 23.431229 | Pct Diff -91.3027550529 |
Task ('IntNRQ', None) | P50 Base 4.011057 | P50 Cmp 4.923845 | Pct Diff 22.7567945307 | P90 Base 13.346714 | P90 Cmp 12.29044 | Pct Diff -7.91411279211 | P99 Base 85.073004 | P99 Cmp 67.905295 | Pct Diff -20.1799727208 | P999 Base 85.073004 | P999 Cmp 67.905295 | Pct Diff -20.1799727208 | P100 Base 191.744926 | P100 Cmp 194.853114 | Pct Diff 1.62100143396 |
Task ('BrowseMonthTaxoFacets', None) | P50 Base 0.170601 | P50 Cmp 0.172844 | Pct Diff 1.31476368837 | P90 Base 0.212197 | P90 Cmp 0.260394 | Pct Diff 22.7133277096 | P99 Base 0.227793 | P99 Cmp 4.692225 | Pct Diff 1959.86356034 | P999 Base 0.227793 | P999 Cmp 4.692225 | Pct Diff 1959.86356034 | P100 Base 10.00639 | P100 Cmp 9.966375 | Pct Diff -0.399894467435 |
Task ('AndHighMed', None) | P50 Base 3.367039 | P50 Cmp 3.095126 | Pct Diff -8.07573063454 | P90 Base 6.445771 | P90 Cmp 4.890087 | Pct Diff -24.1349560821 | P99 Base 6.646047 | P99 Cmp 6.78425 | Pct Diff 2.07947671751 | P999 Base 6.646047 | P999 Cmp 6.78425 | Pct Diff 2.07947671751 | P100 Base 42.892862 | P100 Cmp 96.998233 | Pct Diff 126.140734092 |
Task ('OrHighNotMed', None) | P50 Base 1.75091 | P50 Cmp 1.851148 | Pct Diff 5.72490876173 | P90 Base 3.059898 | P90 Cmp 3.291761 | Pct Diff 7.57747480472 | P99 Base 5.304103 | P99 Cmp 4.755972 | Pct Diff -10.3340941909 | P999 Base 5.304103 | P999 Cmp 4.755972 | Pct Diff -10.3340941909 | P100 Base 21.934264 | P100 Cmp 16.361938 | Pct Diff -25.4046636805 |
Task ('Prefix3', None) | P50 Base 15.061956 | P50 Cmp 14.031055 | Pct Diff -6.84440321031 | P90 Base 28.636916 | P90 Cmp 25.283538 | Pct Diff -11.7099830163 | P99 Base 425.912794 | P99 Cmp 443.836014 | Pct Diff 4.20819009255 | P999 Base 425.912794 | P999 Cmp 443.836014 | Pct Diff 4.20819009255 | P100 Base 497.136127 | P100 Cmp 551.753833 | Pct Diff 10.9864689033 |
Task PKLookup | P50 Base 11.334723 | P50 Cmp 10.857995 | Pct Diff -4.20590781089 | P90 Base 89.284901 | P90 Cmp 32.062856 | Pct Diff -64.0892741764 | P99 Base 334.811782 | P99 Cmp 73.551516 | Pct Diff -78.0319809654 | P999 Base 334.811782 | P999 Cmp 73.551516 | Pct Diff -78.0319809654 | P100 Base 363.062965 | P100 Cmp 418.731587 | Pct Diff 15.3330489107 |
Task ('HighTerm', None) | P50 Base 1.148459 | P50 Cmp 1.212344 | Pct Diff 5.56267137094 | P90 Base 1.54 | P90 Cmp 1.915795 | Pct Diff 24.4022727273 | P99 Base 2.138094 | P99 Cmp 3.134936 | Pct Diff 46.622926775 | P999 Base 2.138094 | P999 Cmp 3.134936 | Pct Diff 46.622926775 | P100 Base 2.722231 | P100 Cmp 153.54749 | Pct Diff 5540.50185308 |
[Legacy Jira: Atri Sharma (@atris) on Aug 05 2019]
Commit 0dfbf557bae67acaf9c920cb51a04b7c7341d621 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0dfbf55
LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache (#815)
Introduce asynchronous caching for heavy queries using the passed in Executor to IndexSearcher.
[Legacy Jira: ASF subversion and git services on Sep 28 2019]
Hi @atris,
This commit creates some CI errors:
ant test -Dtestcase=TestLRUQueryCache -Dtests.method=testLRUEviction -Dtests.seed=DC3A333680E821EB -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=en-US -Dtests.timezone=Africa/Luanda -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
[Legacy Jira: Ignacio Vera (@iverase) on Sep 30 2019]
Taking a look – although I am unable to reproduce with a simple ant test.
Can you point me to the CI link so that I can dive deeper into the error output?
[Legacy Jira: Atri Sharma (@atris) on Sep 30 2019]
Here is the failure:
https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+master/5526/console
I had a look and it is a race condition. We are checking if the query has been cached just after executing a query but if doing it async, it might happen that the query is still not there.
[Legacy Jira: Ignacio Vera (@iverase) on Sep 30 2019]
Interesting – I did not realise that testLRUEviction could also cause LRUQueryCache to cache asynchronously, hence did not update it to handle the same (in the manner testLRUConcurrentLoadAndEviction does).
I have pushed a test fix now – beasted the test 50 times with the seed you provided, and also beasted the entire TestLRUQueryCache suite 20 times with the seed. Ran the entire Lucene test suite – came in clean.
It is curious to note that I could not reproduce the test failure without the seed even after running multiple times – kudos to the CI!
[Legacy Jira: Atri Sharma (@atris) on Sep 30 2019]
Kudos to randomized testing philosophy!
[Legacy Jira: David Smiley (@dsmiley) on Oct 01 2019]
Hi @atris,
We have another error in CI that I think it is related to this issue. You can reproduce it by running the following command:
ant beast -Dbeast.iters=10 -Dtests.dups=6 -Dtestcase=TestLatLonShape -Dtests.method=testLUCENE8736* -Dtests.seed=D784C9A46B62FB84 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=brx -Dtests.timezone=America/Guatemala -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
[Legacy Jira: Ignacio Vera (@iverase) on Oct 01 2019]
Opened https://github.com/apache/lucene-solr/pull/912
Curious – looks like the query being cached is insanely heavy – it takes more than 200ms to cache!
[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]
@ivera , Regarding the other failure in TestLRUQueryCache, I investigated and that is a definite case of async caching in action – the cache's cache count will not be updated until the async loading completes, before which the test cannot check the value in a deterministic manner. The solution is to retry in case the cache count does not match the target. Making that change lead to a clean beast of TestLRUQueryCache 50 times. I will push the change shortly.
[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]
Wait @atris
I don't think this is a proper solution as it means that depending when you query you will get a different result. I think is wrong as queries should always return the same hits regardless of the cache.
[Legacy Jira: Ignacio Vera (@iverase) on Oct 01 2019]
@ivera, I am not referring to the LatLonShape failure – I agree with you that the test should be fine even with async caching since it directly checks the result count.
However, for testMinSegmentSizePredicate, the failing assertion is on the cache's cache count – which is directly dependent on the race between the call and the completion of the async load (as the count is incremented only when the async load completes).
Hence, for testMinSegmentSizePredicate, the correct fix would be to retry the check a few times before marking a test failure. WDYT?
[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]
The inFlightAsyncLoadQueries
logic doesn't look right. It seems like it would prevent caching on one segment if another segment is already caching the same query?
Also we shouldn't rely on Thread#sleep in tests, can we get rid of them?
[Legacy Jira: Adrien Grand (@jpountz) on Oct 01 2019]
@jpountz Good catch, My understanding was that the cache was always per segment, so just using the query as a cache key should be fine. Thanks for highlighting this issue.
Should I revert this commit?
Would using the incoming LeafReaderContext's top level ord + the query's key as the key in inFlightAsyncLoadQueries suffice to ensure that segments get their independent space in the cache?
RE: Thread#sleep, I agree that it is not a pretty pattern. Maybe replacing a CountDownLatch will work?
[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]
Attached is a patch which reproduces the issue @jpountz mentioned in a unit test. Using a Pair of Query + Segment's ID object as the Set (inFlightAsyncLoadQueries)'s key makes the test pass.
0001-Reproduce-across-segment-caching-of-same-query.patch
[Legacy Jira: Atri Sharma (@atris) on Oct 01 2019]
Should I revert this commit?
It would probably make things easier indeed, this way we can take the time to think about how to address the various issues.
Would using the incoming LeafReaderContext's top level ord + the query's key as the key in inFlightAsyncLoadQueries suffice to ensure that segments get their independent space in the cache?
I don't think the ord is enough as the query cache can be shared across multiple readers. This is what Elasticsearch does for instance, the cache is shared across all shards that are allocated on the same node. This issue is not specific to asynchronous caching, it could already happen before that the same query gets cached twice concurrently, which is wasteful. Maybe we should ignore this issue for now and work on addressing it as a follow-up.
RE: Thread#sleep, I agree that it is not a pretty pattern. Maybe replacing a CountDownLatch will work?
Anything that gives a signal when the caching is done instead of busy waiting would be better. One easy way to do this might be to shutdown the pool and call awaitTermination?
[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]
Hi @jpountz
It would probably make things easier indeed, this way we can take the time to think about how to address the various issues.
Sure, happy to revert. I managed to create a standalone unit test which reproduces this case, though, and posted a patch last night. I am in process of raising a PR with the fix and the test – would you want to take a look at that, or should I revert right away?
I don't think the ord is enough as the query cache can be shared across multiple readers
One way I hacked around this was to use the LeafReaderContext's ID field + the query as the unique identifier for the cache. This was inspired by the way CompositeReaderContext uses its ID field. Does this sounds a viable approach?
[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]
I'd rather revert to stop test failures as soon as possible. I'll take time to look at the PR later today.
Ideally, we'd use the core cache key of the reader? See LeafReader#getCoreCacheHelper.
[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]
Commit cd09808f6b54b33e350a0739dccb602cb270abad in lucene-solr's branch refs/heads/revert-815-LUCENE-8213 from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=cd09808
Revert "LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache (#815)"
This reverts commit 0dfbf557bae67acaf9c920cb51a04b7c7341d621.
[Legacy Jira: ASF subversion and git services on Oct 02 2019]
Commit 302cd09b4ce7bd3049f8480287b3dd03bb838b0d in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=302cd09
Revert "LUCENE-8213: Introduce Asynchronous Caching in LRUQueryCache (#815)" (#914)
This reverts commit 0dfbf557bae67acaf9c920cb51a04b7c7341d621.
[Legacy Jira: ASF subversion and git services on Oct 02 2019]
@jpountz Reverted.
I will use CacheHelper's core cache key + the query cache key as the inFlightAsyncMap's key
[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]
Can we do this as a separate change that applies to both the concurrent and the serial case?
[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]
@jpountz I am not sure I follow – inFlightAsyncQueries map will only be involved in the concurrent load case? I am not sure I get your thoughts behind fixing the serial case.
[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]
My thinking is that this is an attempt to avoid being wasteful by caching twice the same query concurrently. However this may not only occur when IndexSearcher is configured with an executor. This also happens when IndexSearcher is configured without an executor and two concurrent requests try to cache the same query. The cache simply ignores this issue today and we would generate a cache entry twice. So to make things easier to review I was suggesting ignoring this problem as well for the case that IndexSearcher is configured with an executor, and coming back later to this problem with a fix that works for all cases.
[Legacy Jira: Adrien Grand (@jpountz) on Oct 02 2019]
@jpountz Just noticed your latest comment – I raised a PR (https://github.com/apache/lucene-solr/pull/916) for this change – it includes the logic to use CacheKey + Query as the map's key as well (sorry about that).
Please let me know if this logic looks fine – I will raise another PR to use that combination as the key for the serial case.
[Legacy Jira: Atri Sharma (@atris) on Oct 02 2019]
FWIW: I just attached thetaphi_Lucene-Solr-master-Linux_24839.log.txt, a jenkins log that shows two other failures that seem to be related to this issue...
Checking out Revision 3c399bb696073bdb30f278309410a50effabd0e7 (refs/remotes/origin/master)
...
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestLatLonDocValuesQueries -Dtests.method=testAllLatEqual -Dtests.seed=8D56B48917EDB35F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=zh-Hans -Dtests.timezone=America/Bahia -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
[junit4] FAILURE 2.67s J0 | TestLatLonDocValuesQueries.testAllLatEqual <<<
[junit4] > Throwable #1: java.lang.AssertionError: wrong hit (first of possibly more):
[junit4] > FAIL: id=235 should match but did not
[junit4] > query=point:polygons([[0.0, 1.401298464324817E-45] [35.0, 1.401298464324817E-45] [35.0, 180.0] [0.0, 180.0] [0.0, 1.401298464324817E-45] ]) docID=227
[junit4] > lat=0.0 lon=41.82071185670793
[junit4] > deleted?=false polygon=[0.0, 1.401298464324817E-45] [35.0, 1.401298464324817E-45] [35.0, 180.0] [0.0, 180.0] [0.0, 1.401298464324817E-45]
...
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestFieldCacheSortRandom -Dtests.method=testRandomStringSort -Dtests.seed=F7E42E6905E37945 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=khq-ML -Dtests.timezone=America/Chihuahua -Dtests.asserts=true -Dtests.file.encoding=UTF-8
[junit4] FAILURE 0.15s J1 | TestFieldCacheSortRandom.testRandomStringSort <<<
[junit4] > Throwable #1: java.lang.AssertionError: expected:<[65 76 65 64 6d 68 76 68 75 77 61 71 64 63 65 63 7a 79 77 77 63 69 71 62 76 70 7a 62 6d 66 75 67 61]> but was:<[66 6c 75 6e 78 63 67 66 63 77 7a 6b 69 6d 7a 77 62 6c 71 61 79 61 67 71 6f 69 66 71 64 6a 66]>
both of those seeds seem to fail reliably against GIT:3c399bb6960 (the version checked out by the jenkins run) but both seem to start passing reliably as of GIT:302cd09b4ce (when this issue was reverted)
I should point out however in case it's helpful: when the failures reproduce, the specifics of the failures are always different – suggesting that parallel thread execution is affecting the results, since the randomization of the index data should be deterministic based on the seed.
[Legacy Jira: Chris M. Hostetter on Oct 02 2019]
Thanks @hossman
Interestingly, all but one test failures are coming from LatLon queries – is there anything special about them?
[Legacy Jira: Atri Sharma (@atris) on Oct 03 2019]
I don't think these queries have anything special besides being a bit slow.
[Legacy Jira: Adrien Grand (@jpountz) on Oct 04 2019]
Sorry to distract. I am trying to understand LRUQueryCache
as it has a lot of excellent engineering invested into it. For asynchronous caching without duplicate loads, wouldn't a CompletableFuture
as the cached value be suitable? The cache doesn't handle the cache stampedes, so the same entry may be computed twice with the first to insert as the winner, and the other threads dropping their work. Ideally by using memoization the other threads would defer to the first and avoid the costly operation. Similarly the cache lookup is skipped if the global lock is held and the value computed, but not stored, which means a busy lock reduces performance across all usages.
Would there be a benefit/liability in using a dedicated caching library like Caffeine? That does handle concurrency in a smoother fashion, which could be beneficial here. I am also curious if the hit ratios of LRU could be improved upon as search problems tend to have a strong frequency bias.
[Legacy Jira: Ben Manes on Nov 03 2019]
~ben.manes
I believe this would work. Regarding Caffeine, I commented on the other issue.
[Legacy Jira: Adrien Grand (@jpountz) on Nov 07 2019]
If you want to give it a try, I'd be happy to review it.
[Legacy Jira: Adrien Grand (@jpountz) on Nov 07 2019]
Commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2d98a61
LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)
[Legacy Jira: ASF subversion and git services on Nov 26 2019]
Commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2d98a61
LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)
[Legacy Jira: ASF subversion and git services on Nov 26 2019]
Commit 09cd82a6fc51b90275ed326812e1d947c863952e in lucene-solr's branch refs/heads/revert-916-revised_LUCENE-8213 from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=09cd82a
Revert "LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)"
This reverts commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49.
[Legacy Jira: ASF subversion and git services on Nov 28 2019]
Commit 165f6b276c021b3328e3a37dc125f666ece458d4 in lucene-solr's branch refs/heads/master from Atri Sharma https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=165f6b2
Revert "LUCENE-8213: Asynchronous Caching in LRUQueryCache (#916)" (#1044)
This reverts commit 2d98a619dd96abfa716098bbc5cac2bc3a1b7d49.
[Legacy Jira: ASF subversion and git services on Nov 28 2019]
IndexOrDocValuesQuery allows to combine costly range queries with a selective lead iterator in an optimized way. However, the range query at some point gets cached by a querying thread in LRUQueryCache, which negates the optimization of IndexOrDocValuesQuery for that specific query.
It would be nice to see an asynchronous caching implementation in such cases, so that queries involving IndexOrDocValuesQuery would have consistent performance characteristics.
Legacy Jira details
LUCENE-8213 by Amir Hadadi on Mar 17 2018, updated Nov 28 2019 Attachments: 0001-Reproduce-across-segment-caching-of-same-query.patch, thetaphi_Lucene-Solr-master-Linux_24839.log.txt Linked issues: