opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.79k stars 1.82k forks source link

[BUG] Negative Search Stats #16598

Open dzane17 opened 1 week ago

dzane17 commented 1 week ago

Describe the bug

We have identified a bug where search statistics are being incorrectly reported as negative, which is causing failures when accessing the nodes/stats API. This issue is specifically related to the handling of negative values in the SearchStats class. The exception occurs when the API attempts to serialize the statistics, causing a java.lang.IllegalStateException due to the use of VLong encoding, which does not support negative values.

Example Response from nodes/stats API:

{
    "_nodes": {
        "total": 6,
        "successful": 5,
        "failed": 1,
        "failures": [
            {
                "type": "failed_node_exception",
                "reason": "Failed node [gxu1I2AFSo2Co-gL3Cq14A]",
                "node_id": "gxu1I2AFSo2Co-gL3Cq14A",
                "caused_by": {
                    "type": "illegal_state_exception",
                    "reason": "Negative longs unsupported, use writeLong or writeZLong for negative numbers [-3]"
                }
            }
        ]
    },
...

Exception in elasticsearch.log:

[2024-10-16T19:55:29,039][WARN ][o.o.t.OutboundHandler    ] [8da3ab95536131cd3252613bab350bd7] send message failed [channel: Netty4TcpChannel{localAddress=/172.16.29.155:9300, remoteAddress=/172.16.29.122:40652}]
java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-140]
    at org.opensearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:307)
    at org.opensearch.index.search.stats.SearchStats$PhaseStatsLongHolder.writeTo(SearchStats.java:88)
    at org.opensearch.index.search.stats.SearchStats$Stats.lambda$writeTo$0(SearchStats.java:393)
    at org.opensearch.common.io.stream.StreamOutput.writeMap(StreamOutput.java:637)
    at org.opensearch.index.search.stats.SearchStats$Stats.writeTo(SearchStats.java:390)
    at org.opensearch.index.search.stats.SearchStats.writeTo(SearchStats.java:588)
    at org.opensearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:959)
    at org.opensearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:273)
    at org.opensearch.indices.NodeIndicesStats.writeTo(NodeIndicesStats.java:207)
    at org.opensearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:429)
    at org.opensearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:104)
    at org.opensearch.transport.OutboundMessage.serialize(OutboundMessage.java:81)
    at org.opensearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:234)
    at org.opensearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:220)

Root Cause

The issue occurs due to the use of StreamOutput.writeVLong() in the SearchStats class, which does not support negative long values. The stack trace shows that when a negative value is encountered, it triggers the exception in elasticsearch.log shown above.

The problematic line in the code is:

out.writeVLong(current);

Source: SearchStats.java#L89

Related component

Search

To Reproduce

This bug does not occur consistently across all domains or search queries. We are still investigating the specific conditions or types of searches that trigger this issue.

Expected behavior

Search stats should not be negative.

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Additional context Add any other context about the problem here.

dbwiddis commented 6 days ago

Probably introduced in https://github.com/opensearch-project/OpenSearch/pull/8386

Similar issue https://github.com/opensearch-project/OpenSearch/issues/13343

dbwiddis commented 6 days ago

The current stat can be decremented in two places, on phase end (success): https://github.com/opensearch-project/OpenSearch/blob/e68838819710d7040cf2b591590285f1b86f0da0/server/src/main/java/org/opensearch/action/search/SearchRequestStats.java#L82

and on phase failure: https://github.com/opensearch-project/OpenSearch/blob/e68838819710d7040cf2b591590285f1b86f0da0/server/src/main/java/org/opensearch/action/search/SearchRequestStats.java#L89

I looked through and didn't find any cases where they might both be called.

However, I did notice that current stat is stored in an unsynchronized map: https://github.com/opensearch-project/OpenSearch/blob/e68838819710d7040cf2b591590285f1b86f0da0/server/src/main/java/org/opensearch/index/search/stats/SearchStats.java#L120

And the key is just the phase name, which are common strings: https://github.com/opensearch-project/OpenSearch/blob/e68838819710d7040cf2b591590285f1b86f0da0/server/src/main/java/org/opensearch/action/search/SearchPhaseName.java#L19-L25

But digging in to the increments and decrements they use atomic operations so that shouldn't be an issue...