open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.55k stars 1.05k forks source link

Set max_analyzed_offset in HighlightBuilder to Fix Highlighting Errors for Large Fields #18495

Closed sonika-shah closed 1 week ago

sonika-shah commented 1 week ago

Describe your changes:

cherry-picked in 1.5.11

Fixes an issue with highlighting large text fields in OpenSearch where we were hitting a max_analyzed_offset error due to highlight size limits .

Solution : Setting max_analyzed_offset directly in the HighlightBuilder at the query level

Error Message:

The length of [description] field of doc in the index has exceeded [1000000] - maximum allowed to be analyzed for highlighting.
image

Both OpenSearch and Elasticsearch have slightly different ways to set this: ElasticSearch:

image

OpenSearch :

image

Solution from the discussion on these GitHub issues:

#

Type of change:

#

Checklist:

github-actions[bot] commented 1 week ago

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR. You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

harshach commented 1 week ago

@sonika-shah lets try the following

"description": {
  "type": "text",
  "analyzer": "om_analyzer",
  "term_vector": "with_positions_offsets"
}
private static HighlightBuilder buildHighlights(List<String> fields) {
  List<String> defaultFields = List.of(FIELD_DISPLAY_NAME, FIELD_DESCRIPTION, FIELD_DISPLAY_NAME_NGRAM);
  defaultFields = Stream.concat(defaultFields.stream(), fields.stream()).toList();
  HighlightBuilder hb = new HighlightBuilder();
  for (String field : defaultFields) {
    HighlightBuilder.Field highlightField = new HighlightBuilder.Field(field);
    highlightField.highlighterType("fvh"); 
    hb.field(highlightField);
  }
  hb.preTags(PRE_TAG);
  hb.postTags(POST_TAG);
  return hb;
}
sonika-shah commented 1 week ago

@harshach , we could go by the , "term_vector": "with_positions_offsets" also, but the only thing is it will increase the storage space, as each document will have term vectors stored alongside the actual data

while with the max_analysed_offset query parameter we could directly truncate the size of doc text we try highlighting on rather than requiring you to reindex or up the limit set on the index. It's a better overall solution. Discussion in the thread here : https://discuss.elastic.co/t/for-large-texts-indexing-with-offsets-or-term-vectors-is-recommended/266115/2

sonarcloud[bot] commented 1 week ago

Quality Gate Passed Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud