opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.06k stars 1.67k forks source link

[BUG] Update_by_query call updates document even if ingest pipeline processor has failed with exception #14337

Open martin-gaievski opened 3 weeks ago

martin-gaievski commented 3 weeks ago

Describe the bug

Doc values got updated after update_by_query call in case ingest pipeline is configured and one of processors in that pipeline has failed.

Related component

Indexing

To Reproduce

  1. Setup cluster with distribution OS 2.11 with following plugins: ml-commons, knn, neural. Create index with settings similar to following:
    {
    "settings": {
        "index.knn": true,
        "default_pipeline": "pipeline-test"
    },
    "mappings": {
        "_source": {
            "excludes": [
                "passage_embedding"
            ]
        },
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 512,
                        "m": 8
                    }
                }
            },
            "name": {
                "type": "text"
            },
            "passage_text": {
                "type": "text"
            }
        }
    }
    }
  2. Setup a model using remote connector of ml-commons (https://opensearch.org/docs/latest/ml-commons-plugin/remote-models/connectors/), configure it in a way it throttles requests. In our test we use openai model and configured it to accept 6 requests per minute. Get model id of that model.
  3. Create ingest pipeline with at least one processor that has "ignore_failures" flag "false":
    PUT /_ingest/pipeline/pipeline-test
    {
    "description": "An NLP ingest pipeline",
    "processors": [
        {
            "text_embedding": {
                "model_id": "<model_id>",
                "field_map": {
                    "name": "passage_embedding"
                },
                "ignore_failure": false
            }
        }
    ]
    }
  4. Ingest several documents:
    POST /_bulk
    { "index": { "_index": "index-test" } }
    { "name": "permission", "test": "Writing a list of random sentences is harder than I initially thought it would be.", "doc_keyword": "workable", "doc_index": 4976 }
    { "index": { "_index": "index-test" } }
    { "name": "sister", "test": "The fifty mannequin heads floating in the pool kind of freaked them out", "doc_keyword": "angry"}
    { "index": { "_index": "index-test" } }
    { "name": "hair", "test": "Too many prisons have become early coffins", "doc_keyword": "likeable", "doc_index": 2351  }
    { "index": { "_index": "index-test" } }
    { "name": "editor", "test": "Greetings from the real universe", "doc_index": 9871 }
    { "index": { "_index": "index-test" } }
    { "name": "statement", "test": "People keep telling me orange but I still prefer pink", "doc_keyword": "entire", "doc_index": 8242  } 
  5. Check that there are no documents with empty passage_embedding value:
    GET /index-test/_search
    {
    "query": {
        "bool": {
            "must_not": [
                {
                    "exists": {
                        "field": "passage_embedding"
                    }
                }
            ]
        }
    }
    }
  6. Execute update_by_query request multiple times until you got an error from the model:
    POST /index-test/_update_by_query
    {
    "query": {
    "range": {
      "doc_index": {
        "gte": 4000,
        "lte": 5000
      }
    }
    },
    "script" : {
    "source": "ctx._source.doc_index++; ctx._source.doc_keyword=\"key1\";ctx._source.test=\"Text random 1\"",
    "lang": "painless"
    }
    }
  7. Run check for documents with empty passage_embedding. If search has returned anything (>= 1 hits) that means there are docs without embeddings. This is not the right behavior, all docs were ingested with embeddings, and only operation that caused embeddings to disappear was update :
    GET /index-test/_search
    {
    "query": {
        "bool": {
            "must_not": [
                {
                    "exists": {
                        "field": "passage_embedding"
                    }
                }
            ]
        }
    }
    }

Expected behavior

Because processor has been configured with 'ignore_failures`false we expect that update call has failed and no changes are stored.

Additional Details

Plugins ml-commons, k-NN, neural-search

Host/Environment (please complete the following information):

Additional context I've tried same scenario without exclude setting for "passage_embedding" field and it works as expected.

        "_source": {
            "excludes": [
                "passage_embedding"
            ]
        },

I assume that behind the scenes document is still updated but because all fields are "included" it copies passage_embedding field value from original document.

peternied commented 2 weeks ago

[Triage - attendees 1 2 3 4 5] @martin-gaievski Thanks for creating this issue, could you create a pull request to fix this issue?