opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
61 stars 65 forks source link

Fix: text chunking processor ingestion bug on multi-node cluster #713

Closed yuye-aws closed 5 months ago

yuye-aws commented 5 months ago

Description

For multi node cluster, the text chunking processor would produce "no such index" error if the configured shard number is less than the number of nodes. This is because some node does not contain the shard information. When we get max token count setting, indicesService fails to find the index information.

IndexService indexService = indicesService.indexServiceSafe(indexMetadata.getIndex());

Issues Resolved

Fix ingestion bug on multi-node cluster

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

yuye-aws commented 5 months ago

Hi maintainers. This PR is a fix towards text chunking processor. Please attach backport 2.x and backport 2.13 labels to this PR.

yuye-aws commented 5 months ago

This PR is still work in progress. Before getting merged, this PR must satisfy the following conditions :

yuye-aws commented 5 months ago

@model-collapse @zane-neo This PR is ready for review now. Please merge this PR after passing all the CI workflow.

chishui commented 5 months ago

Shall we add an IT to cover this "configured shard number is less than the number of nodes" scenario? Can be done in a separate issue and PR.

zhichao-aws commented 5 months ago

Shall we add an IT to cover this "configured shard number is less than the number of nodes" scenario? Can be done in a separate issue and PR.

In current CI all IT are run with one node. I think we can enhance the CI framework by adding the build with -PnumNodes=3. This can help us exclude bugs in distributed scenerio at early stage.

yuye-aws commented 5 months ago

I think we can enhance the CI framework by adding the build with -PnumNodes=3

Good point. We can follow the same process like ml-commons.

yuye-aws commented 5 months ago

@zane-neo The current gradle checks get failed due to model deployed issue. Is it attributed to the latest update in ml-commons, like async http client?

vibrantvarun commented 5 months ago

We can't merge the PR until bwc tests passes.

navneet1v commented 5 months ago

@model-collapse GH workflows are failing. Lets ensure GH actions are successful before approving the PRs

vibrantvarun commented 5 months ago

Even gradle checks are failing @yuye-aws

vibrantvarun commented 5 months ago

"Model not deployed yet" error coming from ml-commons https://github.com/opensearch-project/ml-commons/issues/2382

zane-neo commented 5 months ago

"Model not deployed yet" error coming from ml-commons opensearch-project/ml-commons#2382

This is another issue that related to ml-commons main branch, we'll track this with a new issue. We'll merge this one for now as it fixes an critical issue that could impact on customers.

opensearch-trigger-bot[bot] commented 5 months ago

The backport to 2.13 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.13 2.13
# Navigate to the new working tree
cd .worktrees/backport-2.13
# Create a new branch
git switch --create backport/backport-713-to-2.13
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 2d42408c70e01b95825744bea0182ff361090a4e
# Push it to GitHub
git push --set-upstream origin backport/backport-713-to-2.13
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.13

Then, create a pull request where the base branch is 2.13 and the compare/head branch is backport/backport-713-to-2.13.