Set max_analyzed_offset in HighlightBuilder to Fix Highlighting Errors for Large Fields

sonika-shah commented 1 week ago

Describe your changes:

cherry-picked in 1.5.11

Fixes an issue with highlighting large text fields in OpenSearch where we were hitting a max_analyzed_offset error due to highlight size limits .

Solution : Setting max_analyzed_offset directly in the HighlightBuilder at the query level

Error Message:

The length of [description] field of doc in the index has exceeded [1000000] - maximum allowed to be analyzed for highlighting.

Both OpenSearch and Elasticsearch have slightly different ways to set this: ElasticSearch:

OpenSearch :

Elasticsearch uses hb.maxAnalyzedOffset(MAX_ANALYZED_OFFSET);
OpenSearch uses hb.maxAnalyzerOffset(MAX_ANALYZED_OFFSET);

Solution from the discussion on these GitHub issues:

#

Type of change:

[x] Bug fix
[x] Improvement
[ ] New feature
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Documentation

#

Checklist:

[x] I have read the CONTRIBUTING document.
[ ] My PR title is Fixes <issue-number>: <short explanation>
[ ] I have commented on my code, particularly in hard-to-understand areas.
[ ] For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

github-actions[bot] commented 1 week ago

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR. You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

harshach commented 1 week ago

@sonika-shah lets try the following

"description": {
  "type": "text",
  "analyzer": "om_analyzer",
  "term_vector": "with_positions_offsets"
}

private static HighlightBuilder buildHighlights(List<String> fields) {
  List<String> defaultFields = List.of(FIELD_DISPLAY_NAME, FIELD_DESCRIPTION, FIELD_DISPLAY_NAME_NGRAM);
  defaultFields = Stream.concat(defaultFields.stream(), fields.stream()).toList();
  HighlightBuilder hb = new HighlightBuilder();
  for (String field : defaultFields) {
    HighlightBuilder.Field highlightField = new HighlightBuilder.Field(field);
    highlightField.highlighterType("fvh"); 
    hb.field(highlightField);
  }
  hb.preTags(PRE_TAG);
  hb.postTags(POST_TAG);
  return hb;
}

sonika-shah commented 1 week ago

@harshach , we could go by the , "term_vector": "with_positions_offsets" also, but the only thing is it will increase the storage space, as each document will have term vectors stored alongside the actual data

while with the max_analysed_offset query parameter we could directly truncate the size of doc text we try highlighting on rather than requiring you to reindex or up the limit set on the index. It's a better overall solution. Discussion in the thread here : https://discuss.elastic.co/t/for-large-texts-indexing-with-offsets-or-term-vectors-is-recommended/266115/2

sonarcloud[bot] commented 1 week ago

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

open-metadata / OpenMetadata