opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
57 stars 58 forks source link

[BUG] Text chunking max_chunk_limit error #716

Closed yuye-aws closed 2 months ago

yuye-aws commented 2 months ago

What is the bug?

Text chunking processor will produce the following exception if the number of produced chunks exceed max_chunk_limit. Then the output field will be missing in the output document. Our customers do not wish to encounter data loss.

The number of chunks produced by text_chunking processor has exceeded the allowed maximum of [100]

How can one reproduce the bug?

Use text chunking processor and produce more chunks thank max_chunk_limit.

What is the expected behavior?

There should be no data loss. Suppose parameter max_chunk_limit takes the default value 100. if the generated chunks exceed 100, we only keep the first 100 passages into the results. If user have specified multiple fields for chunking. Please refer to the following example.

What is your host/environment?

Linux and Mac.

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

yuye-aws commented 2 months ago

Raised a PR to fix the bug: https://github.com/opensearch-project/neural-search/pull/717

model-collapse commented 2 months ago

Please elaborate the truncating logic and overall background description.

yuye-aws commented 2 months ago

Expected behavior when we are perform chunking with delimiter algorithm with " " as delimiter and ["a b c", "", "", "d e f"] as the input.

  1. When max_chunk_limit >= 6 or max_chunk_limit == -1, the output should be ["a ", "b ", "c", "d ", "e ", "f"]
  2. When max_chunk_limit <= 2 and max_chunk_limit != -1, the output should be ["a b c", "d e f"]
  3. When max_chunk_limit == 5, the output should be ["a ", "b ", "c", "d ", "e f"]
  4. When max_chunk_limit == 4, the output should be ["a ", "b ", "c", "d e f"]
  5. When max_chunk_limit == 3, the output should be ["a ", "b c", "d e f"]
martin-gaievski commented 2 months ago

@yuye-aws if PR has fixed the issue can we close it?

yuye-aws commented 2 months ago

Sure. I am closing the issue.