msfroh commented 1 year ago

Is your feature request related to a problem? Please describe. When calling out to an external reranker, it's possible that the network call fails (because networks). For our plugin that calls out to AWS Kendra Intelligent Ranking service, we swallow and log any exceptions, and just return results in the original order. While the server-side application log captures what went wrong, the client who made the search request has no way of knowing that they're receiving degraded results.

More broadly, this will be a problem for any search pipeline processor that makes an external call and wants to "fail open", as called out in a comment by @lukas-vlcek on https://github.com/opensearch-project/search-processor/issues/80.

Describe the solution you'd like The best solution I've been able to come up with so far is that the search response could include an extra "warnings" property. This could be an array of strings (or objects?).

Ideally, any warnings could communicate what failed and why, so that a human observing these warnings might be able to take corrective action. For example, a call to an external service may have failed because credentials are no longer valid, or a call quota was exceeded.

For now, I'm leaning toward human-readable string warnings, since I haven't thought of a reason to make them machine-friendly (though that may just be a lack of imagination on my part).

Describe alternatives you've considered The status quo, where we log the warning to application logs "works", but it assumes that callers who may not have access to the application logs don't need to know if a specific response was degraded.

Another option would be to recommend always "failing closed". That is, if a call to an external reranker fails, then just throw an exception (with explanation) instead of returning results. That's bad from an availability standpoint.

We could also optionally fail closed (i.e. via a request parameter), so production traffic fails open, but a "canary" service periodically probes the search cluster asking for an exception if the external call fails. That only works if the canary calls fail the same way as the production traffic calls.

Additional context As a stretch goal, I think we could generalize this to include messages at other "log levels". In particular, it has always bothered me that OpenSearch and its predecessor don't have a debug URL parameter for search requests, the way that Solr has since forever.

Maybe this search response property could have a form like:

"additional_info": {
  "warn": [
    {
      "source": "org.opensearch.search.foo.RankingWhatever",
      "message": "Call to https://rankingservice.example.com/ failed -- invalid credentials"
    }
  ],
  "debug": [
    {
      "source": "org.opensearch.search.query.QueryPhase",
      "message": "Query object sent to Lucene was: '+foo:bar (field1:val1 field2:val2 field3:val3)~2'"
    },
    {
      "source": "org.opensearch.search.query.QueryPhase",
      "message": "Time spent in searcher.search was 53.25ms"
    }
  ]
}

dblock commented 1 year ago

Rather than additional info, I propose that ?explain=true response contain all the information about which re-rankers were called, which ones succeeded and failed, and how long it took. This should work for your canary idea.

For our plugin that calls out to AWS Kendra Intelligent Ranking service, we swallow and log any exceptions, and just return results in the original order. Another option would be to recommend always "failing closed". That is, if a call to an external reranker fails, then just throw an exception (with explanation) instead of returning results. That's bad from an availability standpoint.

I feel pretty strongly that this was the right option because as a user I don't want results that are "sometimes better ranked", I like my systems to be predictable. But I also do understand that calling out to Kendra isn't like all other rerankers because there's some networking involved, and I also understand a website may want some kind of fallback. So I can see how some administrators may want this behavior to be optional - there's another feature request here to support both hard failure and the current behavior.

msfroh commented 1 year ago

I don't think ?explain=true is a great approach, since the explain process is pretty expensive (since it's explaining the scoring applied to every returned result).

Maybe we could add levels of verbosity to the explain param so people can decide if they want a detailed explanation for every result or an overview of where their query spent time and what each component did to the query broadly (which is roughly what Solr's debug parameter offers)?

dblock commented 1 year ago

@msfroh Is it a problem that explain is expensive for when one wants to get debug info in search response? Isn't this exactly what it was designed for?

msfroh commented 1 year ago

Yeah, I guess it's not a problem.

I'm not a fan of the wall of text from explaining scores (and I don't usually need score explanations unless I'm specifically investigating a relevance issue), but sure, we can put all the debug, info, and warning stuff in one place.

dblock commented 1 year ago

Yeah, I guess it's not a problem.

I'm not a fan of the wall of text from explaining scores (and I don't usually need score explanations unless I'm specifically investigating a relevance issue), but sure, we can put all the debug, info, and warning stuff in one place.

I do like the idea of adding more options to limit the depth/verbosity of explain, too! But that's a different feature.

msfroh commented 1 year ago

I was looking into this a bit more, and the explain option (when true) attaches an org.apache.lucene.search.Explanation object to each SearchHit, since it exists to explain why documents are scored the way they are. The Explanation class is entirely devoted to details of score calculations, which is not what this issue is about.

I still don't know where we could put additional (debug/warning) information about query processing, but explain is definitely not the right API.

dblock commented 1 year ago

@msfroh Just because something is implemented in a certain way doesn't mean it's not what users want an API to return. What would our users want? Then we can discuss the implementation.

msfroh commented 1 year ago

I don't believe that our users would want a scoring-focused API to start returning things unrelated to scoring.

dblock commented 1 year ago

That is indeed what explain is designed to do: _Wondering why a specific document ranks higher (or lower) for a query? You can use the explain API for an explanation of how the relevance score (_score) is calculated for every result._, so you have a point.

Another related set of concerns may be https://github.com/opensearch-project/OpenSearch/issues/1061?

macohen commented 1 year ago

1061 is probably closer to what we're looking for here. Following that issue...

msfroh commented 1 year ago

In https://github.com/opensearch-project/ml-commons/issues/1150, @austintlee suggested the possible addition of an ext block to search responses, similar to the one in search requests, providing a flexible extension point for information returned in a search response. I think that idea is pretty intriguing and would satisfy the needs of this issue.

austintlee commented 1 year ago

I just created a PR and am looking to get feedback on that. Thanks.

saratvemulapalli commented 1 year ago

reading through the context, I like free form ext block with an extension point. It cleanly solves abstracts out additional information without fiddling with SearchHits.

opensearch-project / OpenSearch

Add field for warnings (and maybe debug/info?) to SearchResponse #6794

1061 is probably closer to what we're looking for here. Following that issue...