opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.83k stars 1.83k forks source link

[BUG] Ingest pipeline bulk update issue #16663

Open NehaV0307 opened 5 days ago

NehaV0307 commented 5 days ago

Describe the bug

Ingest Pipeline works fine for single call of create, index and Update for pipeline. Bulk create, bulk index works fine for pipeline only when we are performing bulk update it doesn't work.

Related component

Other

To Reproduce

  1. create ingest pipeline

PUT _ingest/pipeline/update_timestamp { "description": "Automatically updates the 'updated' field on insert or update", "processors": [ { "set": { "field": "updated", "value": "{{_ingest.timestamp}}" } } ] }

Output

{ "acknowledged": true }

2.Create index

PUT /on_boarding_employees-1 { "settings": { "index": { "default_pipeline": "update_timestamp" } } }

Output

{ "acknowledged": true, "shards_acknowledged": true, "index": "on_boarding_employees-1" }

Adding Doc:

POST /on_boarding_employees-1/_doc { "type": "ONBOARDING_EMPLOYEE", "name": “Rahul” }

Output

{ "_index": "on_boarding_employees-1", "_id": "9f2pM5MB70XT8uT4kP1K", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 0, "_primary_term": 1 }

Match query Output:

{ "took": 620, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "on_boarding_employees-1", "_id": "9f2pM5MB70XT8uT4kP1K", "_score": 1, "_source": { "name": “Rahul”, "type": "ONBOARDING_EMPLOYEE", "updated": "2024-11-16T06:29:30.826236733Z" } } ] } }

Normal Update:

POST /on_boarding_employees-1/_update/9f2pM5MB70XT8uT4kP1K { "doc": { "type": "ONBOARDING_EMPLOYEE_UPDATED" } }

Output

{ "_index": "on_boarding_employees-1", "_id": "9f2pM5MB70XT8uT4kP1K", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 1, "_primary_term": 1 }

Match query Output:

"took": 268, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "on_boarding_employees-1", "_id": "9f2pM5MB70XT8uT4kP1K", "_score": 1, "_source": { "name": “Rahul”, "type": "ONBOARDING_EMPLOYEE_UPDATED", "updated": "2024-11-16T06:33:05.478645288Z" } } ] } }

Bulk Update:

POST /on_boarding_employees-1/_bulk?pipeline=update_timestamp {"update":{"_id":"9f2pM5MB70XT8uT4kP1K"}} {"doc":{"type":"ONBOARDING_EMPLOYEE14","name":"Aman2"}} {"update":{"_id":"9v2xM5MB70XT8uT4uv0x"}} {"doc":{"type":"ONBOARDING_EMPLOYEE13","name":"Neha"}}

Match query Output:

{ "took": 777, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 2, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "on_boarding_employees-1", "_id": "9v2xM5MB70XT8uT4uv0x", "_score": 1, "_source": { "name": "Neha", "type": "ONBOARDING_EMPLOYEE13", "updated": "2024-11-16T06:38:25.841280080Z" } }, { "_index": "on_boarding_employees-1", "_id": "9f2pM5MB70XT8uT4kP1K", "_score": 1, "_source": { "name": "Aman2", "type": "ONBOARDING_EMPLOYEE14", "updated": "2024-11-16T06:33:05.478645288Z" } } ] } }

Expected behavior

Expected behaviour would be updating the timefield, but it remains same for bulk operation "updated": "2024-11-16T06:33:05.478645288Z"

Additional Details

No response

gaobinlong commented 2 days ago

Similar issue: https://github.com/opensearch-project/OpenSearch/issues/10864, the root cause is that Update API converts the updateRequest to an indexRequest if the document exists, so the default ingest pipeline is executed, but Bulk API keep the updateRequest as the origin.

By checking the code, I think ingest pipeline was designed only for index operation, not for update operation, we can also see that the Index API supports pipeline parameter but Update API doesn't, so maybe we should prevent the default ingest pipeline from being executed in Update API.

For this use case, I've tried to find some workaround, one option is that use painless script to update the updated field, like this:

POST /on_boarding_employees-1/_update/1
{
  "script": {
    "source": "ctx._source.updated =ctx._now;ctx._source.type=params.type",
    "params": {
      "type": "ONBOARDING_EMPLOYEE_UPDATED"
    }
  }
}

or 

POST /on_boarding_employees-1/_bulk
{"update":{"_id":"1"}}
{"script":{"source":"ctx._source.updated =ctx._now;ctx._source.type=params.type","params":{"type":"ONBOARDING_EMPLOYEE_UPDATED"}}}

@andrross @macohen @reta what do you think about this?

reta commented 2 days ago

Thanks @gaobinlong for looking into it

By checking the code, I think ingest pipeline was designed only for index operation, not for update operation, we can also see that the Index API supports pipeline parameter but Update API doesn't, so maybe we should prevent the default ingest pipeline from being executed in Update API.

Found this long thread on the matter [1], TLDR; is that Update API does not support ingest pipelines, we should probably document that (and prevent if possible).

[1] https://github.com/elastic/elasticsearch/issues/17895

gaobinlong commented 18 hours ago

Thanks @reta, I've created an document issue for this and will open a PR later.

For the code, does it make sense that we return an deprecation warning in 2.x version for the update API and then remove the support in 3.0.0? It maybe a breaking change for some users.

reta commented 6 hours ago

Thanks @gaobinlong

Thanks @reta, I've created an document issue for this and will open a PR later.

:+1:

For the code, does it make sense that we return an deprecation warning in 2.x version for the update API and then remove the support in 3.0.0?

But this functionality does not work, does it?