opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.69k stars 1.79k forks source link

[BUG] `delete_by_query` (or `update_by_query`, etc.) with slices creates too many scroll contexts #16289

Open msfroh opened 1 week ago

msfroh commented 1 week ago

Describe the bug

I was contacted by a user of Amazon OpenSearch Service who manages a multi-tenant cluster. When a tenant leaves the cluster, their data is deleted from the cluster using a delete_by_query.

In order to complete this in a timely manner, they've tried setting a slice count greater than 1, to let multiple threads drive the deletion. Unfortunately, each slice opens and holds a scroll context on every shard. With something like 36 shards and 10 slices, this ends up being 360 open scroll contexts.

The cluster imposes a default max limit of 500 open scroll contexts, so this is consuming most of them. If two tenants leave at the same time, there aren't enough available scroll contexts to go around.

Related component

Search:Query Capabilities

To Reproduce

  1. Set up a cluster with (say) 51 shards.
  2. Add a bunch of data.
  3. Execute a delete_by_query with 10 slices.
  4. Get an error saying Trying to create too many scroll contexts. Must be less than or equal to: [500]

Expected behavior

We should be able to delete or update a lot of data without needing to open so many scroll contexts.

One option that I've considered is allowing users to run these operations without transactional semantics. Even though a scroll can run shard-by-shard, any scrolled search starts by opening and registering scroll contexts on every shard that will be held until the scroll is done.

Why? Say you're querying data starting at 1pm and it takes you 5 minutes to get through the first shard. you will only see documents published before 1pm on the first shard (since it's using one IndexReader for the whole traversal). If you open a new IndexReader (wrapped in a scroll context) on the second shard, it will have documents published up to 1:05pm. We open all the scroll contexts at once as a way to pretend to be a transactional system. (Of course, the scroll contexts aren't literally created all at once, because network calls take non-zero time, so consistency is an illusion.)

If I want to delete all documents where userid=12345, I don't necessarily care if it means "Delete all documents published prior to this instant (give or take network lag fanning out to shard) with userid=12345" or "Delete all documents where userid=12345 and if new ones get added while you're deleting, maybe delete those too".

So, maybe this is more of a feature request -- I would like delete_by_query to have an argument that disables (approximate) point-in-time consistency. Let it open a scroll on one shard, rip through it, deleting stuff and then open a scroll on the next shard, etc.

Additional Details

Plugins N/A

Screenshots N/A

Host/Environment (please complete the following information):

Additional context None

sandeshkr419 commented 1 week ago

[Search Triage] - @bowenlan-amzn Do you have any thoughts on this?