opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.14k stars 1.69k forks source link

[BUG] Cluster Health API call can get tripped by circuit breaker #631

Open Bukhtawar opened 3 years ago

Bukhtawar commented 3 years ago

Describe the bug When the JVM memory pressure is high the calls to cluster health might fail with

[2021-04-05T17:37:46,637][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 1
[2021-04-05T17:37:46,631][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 0
[2021-04-05T17:37:44,838][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 0
[2021-04-05T17:37:44,838][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 0
{
    "error": {
        "root_cause": [
            {
                "type": "circuit_breaking_exception",
                "reason": "[parent] Data too large, data for [<http_request>] would be [2029039272/1.8gb], which is larger than the limit of [2023548518/1.8gb], real usage: [2029039272/1.8gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=5285/5.1kb, in_flight_requests=0/0b, accounting=50225284/47.8mb]",
                "bytes_wanted": 2029039272,
                "bytes_limit": 2023548518,
                "durability": "PERMANENT"
            }
        ],
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<http_request>] would be [2029039272/1.8gb], which is larger than the limit of [2023548518/1.8gb], real usage: [2029039272/1.8gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=5285/5.1kb, in_flight_requests=0/0b, accounting=50225284/47.8mb]",
        "bytes_wanted": 2029039272,
        "bytes_limit": 2023548518,
        "durability": "PERMANENT"
    },
    "status": 429
}

Expected behavior Cluster health calls shouldn't get tripped by the circuit breaker as they are important and informative and represents the state of the system

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

tlfeng commented 3 years ago

Hi @Bukhtawar,

Could you explain more about how to reproduce the issue? Looks like it has been fixed in Elasticsearch 5.0 (https://github.com/elastic/elasticsearch/commit/f32b70047241fe319cb37047cc2a47d1b56da6e1), besides, request to / is also whitelisted from Circuit Breaking exception in Elasticsearch 6.5 (https://github.com/elastic/elasticsearch/commit/027a22abf9684897a81e6ca2216dd38214fb8021).

During my own testing, I didn't find "Cluster Health API" call is tripped by circuit breaker. My steps:

  1. Start OpenSearch beta1 in Ubuntu with default setting.
  2. Set the parent circuit breaker with a low limit: curl -XPUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent" : {"indices.breaker.total.limit" : "5%"}}'
  3. Check the heap usage curl "localhost:9200/_cat/nodes?h=heap*&v", found "circuit_breaking_exception" in the response
  4. Check the cluster health curl "localhost:9200/_cluster/health?pretty", got the desired response without error.
anshul291995 commented 3 years ago

Looking into reproducing this issue. Will update.

dblock commented 3 years ago

@anshul291995 @Bukhtawar any updates here, what should we do with this?

Bukhtawar commented 3 years ago

We'll need to try to repro here. I'll see if I can pick this up, any help from any community member would be of great help too

reta commented 2 years ago

@Bukhtawar @dblock would you mind if I try to reproduce and (hopefully) fix it? thanks

reta commented 2 years ago

So far confirming @tlfeng findings, not reproducible for /_cluster/health: the health checks are configured to bypass all circuit breakers, it applies both to rest and transport actions. Certainly more details would help:

dblock commented 2 years ago

@Bukhtawar @dblock would you mind if I try to reproduce and (hopefully) fix it? thanks

No need to ask for permission! Thank you for contributing.

minalsha commented 2 years ago

@Bukhtawar could you please help with details that @reta is seeking for? Thanks

Bukhtawar commented 2 years ago

I'll try to see if I can repro..

anasalkouz commented 2 years ago

Closing this issue. @Bukhtawar, please feel free to reopen incase you are able to reproduce it.

rramachand21 commented 2 months ago

Reopening as this is an issue that needs to be fixed.

andrross commented 2 months ago

[Triage - attendees 1 2 3 4] @rramachand21 Do you have any additional information about reproducing this? The findings above suggest that this API should be configured to bypass all circuit breakers.