match-phase degradation taking a long time and returning more hits than coverage

pfrybar commented 4 weeks ago

Describe the bug

In one of our production Vespa clusters, we are using match-phase degradation to speed up queries which match a large number of document. Under certain scenarios (which are relatively common in our application), we see match-phase being triggered but the query takes a long time and returns a larger number of hits than is both expected (with a max-hits value of 250 and 12 content nodes per group) as well as more than specified in the coverage.documents response field. See below for an example:

To Reproduce

I'm not sure how easy this would be to reproduce. If requested, I can try to provide an example.

Expected behavior

I expect match-phase degradation to be faster and return fewer documents. It works for some queries, as shown below:

Environment (please complete the following information): Self-hosted

Vespa version 8.323.45

jobergum commented 3 weeks ago

Hey,

Could you share what type of queries this is? range search, terms, vector search etc? capped range search has better guarantees, but filters are post-filters practical doc

pfrybar commented 3 weeks ago

👋 sure!

The queries which are causing this issue are relatively simple attribute matching - specifically we're querying for multiple terms in a fieldset of attribute fields which have type array<string> and attribute: fast-search. For example:

        field name type array<string> {
            indexing: summary | attribute
            attribute: fast-search
        }

Our match-phase attribute is defined as:

        field score type float {
            indexing: summary | attribute
            attribute: fast-search
            rank: filter
        }

Match-phase looks like:

        match-phase {
            attribute: score
            order: descending
            max-hits: 250
        }

We use a single-phase ranking function which is decently expensive - a single content node can rank ~50k documents in 100ms (single-threaded). This is the primary reason for attempting to use match-phase degradation.

Unfortunately the actual queries we need to use have a lot of filters so capped range search wouldn't be feasible. We're currently experimenting with two-phase ranking as a workaround for now. Basically first-phase ranking on the score then doing our more expensive ranking in the second-phase limited via rerank-count.

jobergum commented 3 weeks ago

Thanks! It sounds like if ranking is the main complexity that using phased ranking is one alternative with better guarantees.

Under certain scenarios

So i'm assuming that the same type of request (same query value, same filter) - produces the same behaviour with a higher hit count than you expect? Or do you experience that the same type of request (same query value, same filter) produces non-deterministic behaviour?

pfrybar commented 3 weeks ago

Phased ranking might be a viable option, but we would still love to use match-phase since it should be less computationally expensive for the majority of our queries (which have a low number of hits) 🙂

When we first encountered this issue, we saw what seemed to be non-deterministic behaviour where the exact same query would result in two wildly different results, as shown below.

Expected match-phase behaviour:

Match-phase which times out:

After doing some investigation, we could trace the queries which time out to specific content groups - whenever the query would hit a specific set of groups, it would timeout as in the second image.

In an effort to fix the issue, we have since upgraded the cluster (and separately restarted the content nodes) and it has switched entirely to the second behaviour - all queries take a long time and timeout. By increasing the query timeout, we were able to get a full set of results (as shown in the original message). We thought this looked weird since the total number of documents available (fields.totalCount) was higher than the documents covered (coverage.documents).

baldersheim commented 3 weeks ago

I will look into this and see if this is expected behavior, and what can be done to improve this. 1 - How many documents do you have per node, and in total per group. 2 - Are you using more than 1 thread per query ? If so how many ? 3 - Can you run the same query with the builtin unranked rank-profile ? first with &ranking=unranked, and then with &ranking=unranked&ranking.matching.termwiseLimit=0.01 I need both timing, totalCount and coverage information.

pfrybar commented 3 weeks ago

Thanks!

I noticed that the issue is intermittent again - some content groups are behaving as expected (fast response, fewer matched documents) while others are slow for the exact same query. I've included both results below as well.

~19 million documents per node, 230 million in total.
Single thread per query.

Our rank-profile, expected:

  "timing": {
    "querytime": 0.011,
    "summaryfetchtime": 0.001,
    "searchtime": 0.014
  },
  "root": {
    "id": "toplevel",
    "relevance": 1,
    "fields": {
      "totalCount": 6914
    },
    "coverage": {
      "coverage": 0,
      "documents": 287684,
      "degraded": {
        "match-phase": true,
        "timeout": false,
        "adaptive-timeout": false,
        "non-ideal-state": false
      },
      "full": false,
      "nodes": 12,
      "results": 1,
      "resultsFull": 0
    },

Our rank-profile, slow:

  "timing": {
    "querytime": 0.9420000000000001,
    "summaryfetchtime": 0.001,
    "searchtime": 0.9440000000000001
  },
  "root": {
    "id": "toplevel",
    "relevance": 1,
    "fields": {
      "totalCount": 779933
    },
    "coverage": {
      "coverage": 0,
      "documents": 355139,
      "degraded": {
        "match-phase": true,
        "timeout": false,
        "adaptive-timeout": false,
        "non-ideal-state": false
      },
      "full": false,
      "nodes": 12,
      "results": 1,
      "resultsFull": 0
    },

ranking=unranked:

  "timing": {
    "querytime": 0.015,
    "summaryfetchtime": 0,
    "searchtime": 0.016
  },
  "root": {
    "id": "toplevel",
    "relevance": 1,
    "fields": {
      "totalCount": 3062770
    },
    "coverage": {
      "coverage": 100,
      "documents": 232863029,
      "full": true,
      "nodes": 12,
      "results": 1,
      "resultsFull": 1
    },

ranking=unranked&ranking.matching.termwiseLimit=0.01:

  "timing": {
    "querytime": 0.011,
    "summaryfetchtime": 0,
    "searchtime": 0.012
  },
  "root": {
    "id": "toplevel",
    "relevance": 1,
    "fields": {
      "totalCount": 3062770
    },
    "coverage": {
      "coverage": 100,
      "documents": 232863029,
      "full": true,
      "nodes": 12,
      "results": 1,
      "resultsFull": 1
    },

baldersheim commented 2 weeks ago

Thanks, this is what I need. The issue is that the hitratio you have (3062770/232863029) is outside of the area the heuristics have been tested for. Based on the numbers here it is way off. This indicates that the model the heuristics are based on, misses something vital.

This has no simple solution and will require a deep dive with extensive benchmarking to get this right, so no immediate resolution expected.

You can try to experiment with match-phase.max-hits. I suspect it to change the sweetspot. Try increasing to 1000, and 10000, and also reduce to 100 and see what happens. That will also give some indication as to what is wrong with the heuristics.

pfrybar commented 2 weeks ago

Thanks for looking into this.

We've tried adjusting max-hits and reducing it further seems to help, but we need to account for the drop in quality as well so I'm not sure if that's an option.

We see the issue happening on two clusters and the biggest commonality is a large number of documents per content node. Are there any general guidelines you have regarding the max number of documents per node? In our most extreme cluster we have 100M docs per node (and we see the same behaviour on this cluster as well). I assume the match-phase algorithm (and associated heuristics) run on each content node, so would increasing the nodes (and decreasing the docs per node) be a potential solution?

baldersheim commented 2 weeks ago

No it might offset the issue, but is no real solution. The inherent problem here is find the crossover point where to graphs meet. Simplified it is to find the solution for ax + b = c(1-x) + d x being hit rate. The heuristics have been tuned for when the query is expensive to evaluate, not so much when ranking is expensive. That has normally been handled with cheap first-phase, expensive second-phase. But based on the numbers here I see that it can work well for you generally. But as I mention, heuristics a full overhaul, I think is based on experiments done around 2016-17.

pfrybar commented 2 weeks ago

👍 got it, thanks for the explanation.

One interesting consideration for our use case is that we know which queries are expensive and should be limited. It would be great if there was a way to "force" match-phase to trigger - if I'm understanding correctly that would essentially bypass the heuristics and give us more customisable (and predictable) behaviour.

We have a solution using two-phased ranking, but since these queries have a high hit count (as you've seen), even the cheap first-phase ranking ends up being expensive.

vespa-engine / vespa

match-phase degradation taking a long time and returning more hits than coverage #31445