opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
63 stars 66 forks source link

[FEATURE] Add ignore missing field to text chunking processor #906

Closed IanMenendez closed 1 month ago

IanMenendez commented 1 month ago

What solution would you like?

Currently, if a document is ingested by a text chunking processor and the input field is null then the text chunking processor will output an empty list. There is no way to ignore the text chunking processor if the field does not exist

The proposed solution is to add the ignore_missing field to text chunking processors.

If ignore_missing == true then fields that should be chunked but do not exist will not ingest an empty list, instead they will get skipped

example:

Processor:

    {
      "text_chunking": {
        "ignore_missing": true,
        "field_map": {
          "body": "body_chunk"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "tokenizer": "letter"
          }
        }
      }
    }

Input:

{
"name":  "OpenSearch' 
}

Output:

{
"name": "OpenSearch"
}

if ignore_missing == false then it will continue to work as it currently does. Fields that do not exist will have an empty list as output

Processor:

    {
      "text_chunking": {
        "ignore_missing": false,
        "field_map": {
          "body": "body_chunk"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "tokenizer": "letter"
          }
        }
      }
    }

Input:

{
"name":  "OpenSearch' 
}

Output:

{
"name": "OpenSearch"
"body_chunk" : []
}

The default value would be ignore_missing = false

What alternatives have you considered?

To my knowledge, there is no alternative to this

yuye-aws commented 1 month ago

Left a few review comments in https://github.com/opensearch-project/neural-search/pull/907

martin-gaievski commented 1 month ago

can we change the field name to "skip_if_absent" or something of this sort? Problem with "ignore" is that it has ambiguity of not specifying what will happen in case text is empty.

vibrantvarun commented 1 month ago

can we change the field name to "skip_if_absent" or something of this sort? Problem with "ignore" is that it has ambiguity of not specifying what will happen in case text is empty.

+1 to @martin-gaievski

IanMenendez commented 1 month ago

@martin-gaievski @vibrantvarun I do not think the field name "skip_if_absent" makes sense

There are tons of OpenSearch ingest processors that currently have the ignore_missing field name

Examples: https://opensearch.org/docs/latest/ingest-pipelines/processors/split/#configuration-parameters https://opensearch.org/docs/latest/ingest-pipelines/processors/lowercase/#configuration-parameters https://opensearch.org/docs/latest/ingest-pipelines/processors/dissect/#configuration-parameters

I prefer ignore_missing to keep consistency between other ingest processors

martin-gaievski commented 1 month ago

if other processors has field with similar functionality then I agree, this name makes sense, although semantically it's not the best. Thanks for checking config of other processors.

yuye-aws commented 1 month ago

Closing this issue as the PR has been merged. Thanks for your contribution @IanMenendez !