Closed yuye-aws closed 2 months ago
Raised a PR to fix the bug: https://github.com/opensearch-project/neural-search/pull/717
Please elaborate the truncating logic and overall background description.
Expected behavior when we are perform chunking with delimiter algorithm with " "
as delimiter and ["a b c", "", "", "d e f"]
as the input.
["a ", "b ", "c", "d ", "e ", "f"]
["a b c", "d e f"]
["a ", "b ", "c", "d ", "e f"]
["a ", "b ", "c", "d e f"]
["a ", "b c", "d e f"]
@yuye-aws if PR has fixed the issue can we close it?
Sure. I am closing the issue.
What is the bug?
Text chunking processor will produce the following exception if the number of produced chunks exceed max_chunk_limit. Then the output field will be missing in the output document. Our customers do not wish to encounter data loss.
How can one reproduce the bug?
Use text chunking processor and produce more chunks thank
max_chunk_limit
.What is the expected behavior?
There should be no data loss. Suppose parameter
max_chunk_limit
takes the default value 100. if the generated chunks exceed 100, we only keep the first 100 passages into the results. If user have specified multiple fields for chunking. Please refer to the following example.What is your host/environment?
Linux and Mac.
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Add any other context about the problem.