opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.53k stars 1.75k forks source link

Support timeout based search request cancellation #817

Open sohami opened 3 years ago

sohami commented 3 years ago

Is your feature request related to a problem? Please describe. Currently the optional "timeout" parameter in the SearchRequest applies to the individual child shard level search requests not at the parent search request. The shard request to a node is sent in multiple batches based on "maxConcurrentRequestsPerNode" parameter. So if a search request results in sending N such batches, the parent request timeout will essentially be N*batchNumber. Also the timeout is only honored in query phase and not in Fetch phase. If there is a long running search for which client doesn't want to wait for the result anymore, they have to use the task API to cancel such request. In cases, when user doesn't initiate the cancellation the previous search will still be consuming the cluster resources until it completes.

Describe the solution you'd like The proposal is to have a separate parameter in search request like "cancel_after_timeinterval" which can be set by the user both at request level and at cluster level. Based on this new parameter, after the timeout expiry the search requests will be cancelled automatically using the cancellation framework. This will help: 1) to reduce the wasted resource usage. 2) automatic cancellation mechanism for the search request, where client doesn't have to explicitly use the task API to cancel it.

Describe alternatives you've considered 1) Use the existing timeout parameter in search request instead of new parameter. This may create confusion for current users because of change in behavior and may break certain client applications. Current timeout behavior is to return partial results, however with cancellation no results will be returned since fetch phase will error out for cancelled search requests.

Additional context Add any other context or screenshots about the feature request here.

Bukhtawar commented 3 years ago

I guess we should also have a mechanism to support partial results we have gotten before the timeout unless allow_partial_search_results is set to false

sohami commented 3 years ago

@Bukhtawar: Cancellation mechanism will be helpful to terminate the workload in events when it is of no use to the client and it wants to stop the workload from consuming any more resources ASAP. If client submits a cancellation request for a task externally using task API then also the search request is failed in fetch phase irrespective of allow_partial_search_results flag. I would prefer to keep the behavior same in both the use cases.

To provide more context, the allow_partial_search_results flag is only honored in query phase after getting results (docId) from all or subset of the shards. If the search request is cancelled before fetch phase then even send of fetch request from coordinator to shards will be failed with task cancellation exception.

AmiStrn commented 3 years ago

I was looking into a similar task recently, I support this requirement 100%. What do we need to get started on this? @sohami are you in the process of writing the code for this yet?

sohami commented 3 years ago

@AmiStrn - yes. I was waiting on some initial feedback. Will raise the PR as soon as I am done

sohami commented 3 years ago

@AmiStrn - I am dividing the change into 2 separate PRs. First one introduces the request level parameter support. Second one will consume the parameter to schedule cancellation task (WIP). Would be great if you can help review this.

AmiStrn commented 3 years ago

Would be great if you can help review this.

Gladly:)

Bukhtawar commented 3 years ago

@sohami I don't think size 0 aggregations have a fetch phase. Does partial results make sense here

I guess it should work out of the box since we plan on simply cancelling

sohami commented 3 years ago

@Bukhtawar - Yes in cases when there is no fetch phase and some of the shards successfully completed the execution before cancellation, then partial results will be returned if allow_partial_search_results is set to true. But note that the cancellation on timeout is basically an intent from client that it is not waiting on any results anymore and just wants the request to terminate. In case client wants to get the partial results, then current search timeout parameter should be used. That will enforce timeout only in search phase and irrespective of query type with size >=0, it will return the partial results.

AmiStrn commented 3 years ago

@dblock how do we get assignee's from the maintainers to review and approve PR's such as https://github.com/opensearch-project/OpenSearch/pull/986 ? (I had commented on the PR, but have not approved since I'm not a maintainer)

edit: didn't notice that there is quite some lag with the PRs :) Please let us know what to expect in terms of a timeline on this.

dblock commented 3 years ago

@dblock how do we get assignee's from the maintainers to review and approve PR's such as #986 ? (I had commented on the PR, but have not approved since I'm not a maintainer) edit: didn't notice that there is quite some lag with the PRs :) Please let us know what to expect in terms of a timeline on this.

Can't promise an SLA rn, but we do also have some automation that is nagging maintainers that PRs are open for longer than we would like without action. For now, if you feel that no traction has been had on an issue, feel free to tag me and I'll go find someone to review.

dblock commented 3 years ago

edit: didn't notice that there is quite some lag with the PRs :) Please let us know what to expect in terms of a timeline on this.

While you're not a maintainer you did a solid review of that PR, so nobody (including myself) felt the need to jump in. Thank you. I will click buttons after the next iteration to get more of the tests to run, and take a closer look at the code as well if needed.

dblock commented 3 years ago

https://github.com/opensearch-project/OpenSearch/pull/986 implements much of this, leaving to @sohami to close when you feel like it does everything you want or close and open smaller issues for the max_-related business.

kkewwei commented 5 months ago

I guess we should also have a mechanism to support partial results we have gotten before the timeout unless allow_partial_search_results is set to false

@sohami @AmiStrn we also have the same need: OS should returns partial results after the timeout in coordinate node. The ideal situation is that the coordinate node returns partial results(if allow_partial_search_results is true) and send cancel request when timeout. If possible, I would like to try to implement it.

kkewwei commented 2 months ago

@sohami, @AmiStrn please have a look, whether it should be supported, I'm pleasure to implement it.

AmiStrn commented 2 months ago

i think it should be supported. sorry i was not more clear about that, i did a 👍 on your commend in agreement with it. I think @sohami is the one to say if they are not going to work on this. @sohami are you actively working on this task?