Open dzane17 opened 2 weeks ago
:x: Gradle check result for e3db94865a88a3f1529340066d76eb83a7a6dfd9: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Thanks @dzane17 , this looks pretty interesting. Are you tracking the investigation of this in another GitHub issue? I'm curious which search phase specifically you're seeing this one since having negative time values anywhere is quite concerning.
In particular this block of code comes to mind, where for scroll requests fetch phase can be performed before query phase is completed https://github.com/opensearch-project/OpenSearch/blob/456ca97f077929ba51d53dc8578b9b770e3a2417/server/src/main/java/org/opensearch/search/SearchService.java#L836-L872
and since the current
is decremented on phase end there might be an issue where we are decrementing before incrementing.
:white_check_mark: Gradle check result for b72b74d4fd044894ecd67be2692e76512c014160: SUCCESS
Attention: Patch coverage is 70.00000%
with 3 lines
in your changes missing coverage. Please review.
Project coverage is 72.10%. Comparing base (
5909e1a
) to head (b72b74d
).
Files with missing lines | Patch % | Lines |
---|---|---|
...org/opensearch/index/search/stats/SearchStats.java | 70.00% | 2 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@jed326 Here is the github issue: https://github.com/opensearch-project/OpenSearch/issues/16598
Actually, current
is a request counter. We have not yet seen the time statistic go negative. In either case though it is concerning.
Regarding this code block in SearchService
, we will need to check exactly where in the search path it is invoked. The coordinator phase listener hooks (onPhaseStart, onPhaseEnd, onPhaseFailure), where search stats are updated, are in AbstractSearchAsyncAction
.
I am wondering if we should have generic way of ignoring negative values (writeUnsignedLong), especially for stat type of use cases instead of handling it every place separately. Missing any occurrence can result in node drops which is never desirable IMO. @jed326 - Thoughts?
@jainankitk I'm actually a little apprehensive about ignoring the negative values at all. This PR is adding negative checking to both count
and timeInMillis
, both of which should not be possible to be negative. These numbers being negative is indicative of some other underlying problem, or at the very least indicates we have an incorrect assumption about the code flow somewhere. By ignoring the negative values we wouldn't be able to catch any such issues in the future as there wouldn't be any symptoms for any such issues.
:x: Gradle check result for e48ffba1abc996ec376a1c5e42a6af51f8bc54fb: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
These numbers being negative is indicative of some other underlying problem, or at the very least indicates we have an incorrect assumption about the code flow somewhere. By ignoring the negative values we wouldn't be able to catch any such issues in the future as there wouldn't be any symptoms for any such issues.
+1
I suspect a situation where both failure and success code are called (with rare failures) but wasn't able to find it in about an hour of searching.
Description
We identified a bug where negative search stats are causing an exception when processed using
VLong
in the following line:https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/search/stats/SearchStats.java#L89
This issue occurs because
VLong
does not handle negative values, resulting in an exception.Solution
We have implemented a temporary fix that adds a check for negative values before calling
writeVLong
. If a negative value is encountered, a warning is logged, and0
is written instead of the negative value:This fix addresses the immediate problem but does not solve the root cause. We will continue to investigate why search stats are negative in the first place.
Related Issues
https://github.com/opensearch-project/OpenSearch/issues/16598
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.