opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.84k stars 1.83k forks source link

Add check and handle negative SearchRequestStats #16569

Open dzane17 opened 2 weeks ago

dzane17 commented 2 weeks ago

Description

We identified a bug where negative search stats are causing an exception when processed using VLong in the following line:

https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/search/stats/SearchStats.java#L89

out.writeVLong(current);

This issue occurs because VLong does not handle negative values, resulting in an exception.

Solution

We have implemented a temporary fix that adds a check for negative values before calling writeVLong. If a negative value is encountered, a warning is logged, and 0 is written instead of the negative value:

            if (current < 0) {
                logger.warn("current is negative: {}", current);
                out.writeVLong(0);
            } else {
                out.writeVLong(current);
            }

This fix addresses the immediate problem but does not solve the root cause. We will continue to investigate why search stats are negative in the first place.

Related Issues

https://github.com/opensearch-project/OpenSearch/issues/16598

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions[bot] commented 2 weeks ago

:x: Gradle check result for e3db94865a88a3f1529340066d76eb83a7a6dfd9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

jed326 commented 2 weeks ago

Thanks @dzane17 , this looks pretty interesting. Are you tracking the investigation of this in another GitHub issue? I'm curious which search phase specifically you're seeing this one since having negative time values anywhere is quite concerning.

jed326 commented 2 weeks ago

In particular this block of code comes to mind, where for scroll requests fetch phase can be performed before query phase is completed https://github.com/opensearch-project/OpenSearch/blob/456ca97f077929ba51d53dc8578b9b770e3a2417/server/src/main/java/org/opensearch/search/SearchService.java#L836-L872

and since the current is decremented on phase end there might be an issue where we are decrementing before incrementing.

github-actions[bot] commented 2 weeks ago

:white_check_mark: Gradle check result for b72b74d4fd044894ecd67be2692e76512c014160: SUCCESS

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 70.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.10%. Comparing base (5909e1a) to head (b72b74d).

Files with missing lines Patch % Lines
...org/opensearch/index/search/stats/SearchStats.java 70.00% 2 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #16569 +/- ## ============================================ + Coverage 72.08% 72.10% +0.02% + Complexity 65099 65079 -20 ============================================ Files 5315 5315 Lines 303572 303581 +9 Branches 43925 43927 +2 ============================================ + Hits 218817 218892 +75 + Misses 66861 66730 -131 - Partials 17894 17959 +65 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dzane17 commented 2 weeks ago

@jed326 Here is the github issue: https://github.com/opensearch-project/OpenSearch/issues/16598

Actually, current is a request counter. We have not yet seen the time statistic go negative. In either case though it is concerning.

Regarding this code block in SearchService, we will need to check exactly where in the search path it is invoked. The coordinator phase listener hooks (onPhaseStart, onPhaseEnd, onPhaseFailure), where search stats are updated, are in AbstractSearchAsyncAction.

jed326 commented 1 week ago

I am wondering if we should have generic way of ignoring negative values (writeUnsignedLong), especially for stat type of use cases instead of handling it every place separately. Missing any occurrence can result in node drops which is never desirable IMO. @jed326 - Thoughts?

@jainankitk I'm actually a little apprehensive about ignoring the negative values at all. This PR is adding negative checking to both count and timeInMillis, both of which should not be possible to be negative. These numbers being negative is indicative of some other underlying problem, or at the very least indicates we have an incorrect assumption about the code flow somewhere. By ignoring the negative values we wouldn't be able to catch any such issues in the future as there wouldn't be any symptoms for any such issues.

github-actions[bot] commented 1 week ago

:x: Gradle check result for e48ffba1abc996ec376a1c5e42a6af51f8bc54fb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

dbwiddis commented 1 week ago

These numbers being negative is indicative of some other underlying problem, or at the very least indicates we have an incorrect assumption about the code flow somewhere. By ignoring the negative values we wouldn't be able to catch any such issues in the future as there wouldn't be any symptoms for any such issues.

+1

I suspect a situation where both failure and success code are called (with rare failures) but wasn't able to find it in about an hour of searching.