[BUG] reranking does not work for nested fields

asfoorial commented 4 months ago

Hi all,

How to use cross-encoder model to rerank a nested text field?

I tried the below but it seems like rerank does not have any effect!

PUT /_search/pipeline/ms-marco-MiniLM-L-12-v2_rerank { “response_processors”: [ { “rerank”: { “ml_opensearch”: { “model_id”: “zyfmI44BdjEy2-xVoJ0V” }, “context”: { “document_fields”: [ “content.text”] } } } ] }

PUT test_nested { “mappings”: { “properties”: { “content”: { “type”: “nested”, “properties”: { “text”:{ “type”:“text” }, “page”:{ “type”:“long” } } } } } }

PUT test_nested/_doc/1 { “content”:[{“text”:“good morning”,“page”: 1}, {“text”:“good evening”,“page”: 2} ] }

PUT test_nested/_doc/2 { “content”:[{“text”:“studying math in my room”,“page”: 1}, {“text”:“playing playstation with my family”,“page”: 2} ] }

GET test_nested/_search?search_pipeline=ms-marco-MiniLM-L-12-v2_rerank { “query”: { “match_all”: {} }, “ext”: { “rerank”: { “query_context”: { “query_text”: “text" } } } }

navneet1v commented 4 months ago

@HenryL27 can you look into this? I am unable to assign this issue to you.

HenryL27 commented 4 months ago

Following the same steps, I get responses:

// query_text = "text"
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": -4.130314,
    "hits": [
      {
        "_index": "test_nested",
        "_id": "1",
        "_score": -4.130314,
        "_source": {
          "content": [
            {
              "text": "good morning",
              "page": 1
            },
            {
              "text": "good evening",
              "page": 2
            }
          ]
        }
      },
      {
        "_index": "test_nested",
        "_id": "2",
        "_score": -4.130314,
        "_source": {
          "content": [
            {
              "text": "studying math in my room",
              "page": 1
            },
            {
              "text": "playing playstation with my family",
              "page": 2
            }
          ]
        }
      }
    ]
  },
  "profile": {
    "shards": []
  }
}
// query_text = "math"
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": -1.7839141,
    "hits": [
      {
        "_index": "test_nested",
        "_id": "1",
        "_score": -1.7839141,
        "_source": {
          "content": [
            {
              "text": "good morning",
              "page": 1
            },
            {
              "text": "good evening",
              "page": 2
            }
          ]
        }
      },
      {
        "_index": "test_nested",
        "_id": "2",
        "_score": -1.7839141,
        "_source": {
          "content": [
            {
              "text": "studying math in my room",
              "page": 1
            },
            {
              "text": "playing playstation with my family",
              "page": 2
            }
          ]
        }
      }
    ]
  },
  "profile": {
    "shards": []
  }
}

So it would appear that it is doing something, although since the scores are the same for all the docs it's probably not doing the right thing. Nested fields are weird. @asfoorial what is the expected behavior in this case?

asfoorial commented 4 months ago

Sorry, I had a typo query_text="hello". It should return the document with id=1 as the first.

query_text="math" should return document with id=2 as the first.

Remember, rerank is expected to rerank documents based on "content.text" nested field with respect to query_text. So rerank should basically take the resulting documents from kNN search and rerank their content.text values. The scores should reflected to their corresponding documents.

asfoorial commented 4 months ago

Rerank is a crucial operation that brings a drastic improvement to search accuracy. Indexing each sentence as a document can occupy more storage space (especially if there is a lot of metadata per document) and takes longer indexing time (can be more than 20%) compared to nested kNN. So, having rerank working for nested fields is expected to bring major improvement to the overall experience in OpenSearch.

HenryL27 commented 4 months ago

It appears that reranking also doesn't work on fields inside of objects, e.g. properties.title. Pretty sure this code is the culprit. Could you confirm that you have this warning message in your logs?

Could not find field content.text in document 1 for reranking! Using the empty string instead.

asfoorial commented 4 months ago

I will give it a try. If this is a bug, is there any chance it will be fixed in OS 2.13.0?

HenryL27 commented 4 months ago

Unlikely. the fix seems nontrivial and 2.13 releases tomorrow.

asfoorial commented 4 months ago

It appears that reranking also doesn't work on fields inside of objects, e.g. properties.title. Pretty sure this code is the culprit. Could you confirm that you have this warning message in your logs?
Could not find field content.text in document 1 for reranking! Using the empty string instead.

Yes I am getting the same error.

asfoorial commented 4 months ago

Looking at the code, I noticed that the below method is checking fields and source but does not check the source in inner hits which is what we need. In fact rerank should be applied to inner hits so that if we return inner_hits they should be returned in a reranked order and the rerank score should be reflected in the parent document. In a practical scenario, I would have PDF documents that I index into OpenSearch documents. The content of each document will be converted into text chunks (sentences or paragraphs) and then saved into nested kNN in the document. Now, the user would be interested to find the most relevant PDF document and also the most relevant chunk in that document to an input query. So if I search for "math" then the result should have document id=2 as the first result and in the inner hits should have the { "text": "studying math in my room", "page": 1 } as the first element in the inner hits since it is the most relevant chunk.

private String contextFromSearchHit(final SearchHit hit, final String field) { if (hit.getFields().containsKey(field)) { Object fieldValue = hit.field(field).getValue(); return String.valueOf(fieldValue); } else if (hit.hasSource() && hit.getSourceAsMap().containsKey(field)) { Object sourceValue = ObjectPath.eval(field, hit.getSourceAsMap()); return String.valueOf(sourceValue); } else { log.warn( String.format( Locale.ROOT, "Could not find field %s in document %s for reranking! Using the empty string instead.", field, hit.getId() ) ); return ""; } }

HenryL27 commented 3 months ago

Hey @asfoorial, sorry it's taken me so long to get to this. I'm starting work on implementing a nested context source fetcher and I wanted to run my design by you to make sure it satisfies your requirements.

We will implement a new context source fetcher, nested_document_field that does nested stuff:
```
{
"rerank": {
"ml_opensearch": {
  "model_id": "ael;rjthnerotphiwntr"
},
"context": {
  "nested_document_field": "content.text"
}
}
}
```
Another alternative is to allow the nested fetcher to accept a list of fields and concatenate them, as with the regular document field fetcher. This gets complicated really quickly when you add more than one nested field, (cartesian products are hard to sort), so we would limit it to one nested field in the list of fields.
At rerank time, we'll rescore every sub-document, sort within each document, and then take the max sub-score as the overall document score and re-sort the top level docs.

How does that sound?

Also do you anticipate a need to rerank based on deeper nestings than one layer or can I impose another limit there? At a certain point I worry about performance

navneet1v commented 3 months ago

@HenryL27 can we use the same document context source and if string contains . in field name then its nested field. That us how opensearch also handle nested fields.

Only limitation I know for this is if its an array field.

HenryL27 commented 3 months ago

@navneet1v certainly, fixing general xpaths was the first thing I did yesterday. I this issue is specifically about nested arrays - see the index mappings at the beginning of this thread - so the array stuff is precisely what we need to solve, right? I suppose there is a solution where the logic to deal with nested arrays is just a part of the document context fetcher. You have to impose the limitation of one nested field as in the "alternative solution" above, or just forgo inner-hit sorting altogether. And we still have the question of doubly-/triply-/etc nested documents. e.g.

{
  "_source": {
    "inner": [
      { "inner2": [
        { "text": "hello" },
        { "text": "world" }
      ]},
      { "inner2": [
        { "text": "goodbye" },
        { "text": "moon" }
      ]}
    ]
  }
}

asfoorial commented 3 months ago

@HenryL27 thank you and the team for returning back to me on this issue. Yes the first open will satisfy my requirement. I also recommend keeping it simple with one nested field and one nesting level.

In addition, it would be also good to allow returning inner hits with sub-documents sorted based on the reranking result. One question about this though, will there be multiple calls to the cross encoder model (a call per document?) I am only concerned about the performance here.

One last thing, current I use a truncate processor in the search pipeline to prevent users from abusing ml nodes with reranking over large number of hits at a time. This works well for normal fields. But how would work for nested fields? And is there a way have a similar truncate behavior done on nested fields?

Thanks

HenryL27 commented 3 months ago

I also recommend keeping it simple with one nested field and one nesting level.

Music to my ears!

We'll try to batch the inferences, although your main source of latency is gonna be based on the number of things you're reranking, not the number of inference invocations/network calls. If we want to re-score all the nests then we need to re-score all the nests.

I don't think the truncation processor works on nested fields, but I could be wrong. And as far as I'm aware there isn't one that works specifically on nests... so this will potentially make your system a little abuseable, although I'm not sure of the utility of this kind of abuse apart from a malicious attack. But maybe a nested truncation feature request is in order?

And yes, inner sorting should be easy.

vamshin commented 1 month ago

@HenryL27 are there any plans to pick this up?

opensearch-project / neural-search

[BUG] reranking does not work for nested fields #657