opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.85k stars 1.83k forks source link

CircuitBreakingException could provide more information to what's actually too large #1661

Open dblock opened 2 years ago

dblock commented 2 years ago

Is your feature request related to a problem? Please describe.

A cluster was hitting a circuit breaker.

RemoteTransportException[[...][ip:9300][cluster:monitor/nodes/info[n]]]; nested: CircuitBreakingException[[parent] Data too large, data for [cluster:monitor/nodes/info[n]] would be [2061357276/1.9gb], which is larger than the limit of [2023548518/1.8gb], real usage:
 [2061355280/1.9gb], new bytes reserved: [1996/1.9kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=1996/1.9kb, accounting=194546028/185.5mb]];
Caused by: CircuitBreakingException[[parent] Data too large, data for [cluster:monitor/nodes/info[n]] would be [2061357276/1.9gb], which is larger than the limit of [2023548518/1.8gb], real usage: [2061355280/1.9gb], new bytes reserved: [1996/1.9kb], usages [request=0/0b, fielddata=0/0b, in_flight_request
s=1996/1.9kb, accounting=194546028/185.5mb]]
        at org.opensearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:359)
        at org.opensearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:122)

After doing heap dumps the suspect was

67 instances of org.opensearch.index.shard.IndexShard, loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x8a660000 occupy 1,263,928,920 (63.97%) bytes.

It seems that number of segments was huge, and maintaining metadata for all those segments was consuming too much memory.

Can the error message just tell us that?

Describe the solution you'd like

In this case the error message should say that the number of segments is too large to fit in memory and recommend troubleshooting steps.

Bukhtawar commented 2 years ago

I had opened something similar a while back https://github.com/elastic/elasticsearch/issues/58404

meghasaik commented 2 years ago

I will look into this.

meghasaik commented 2 years ago

As mentioned above, the information provided right now does not specify much as it just tells that the data is large. The approach can be that you use the token bucket algorithm to calculate the number of times the request trips in terms of percentage basis. The doubts I had were:

→ Does this request trip approach mean that the number of segments are more and segment memory is large.

→Here, we will be changing the durability from transient to permanent if the trip count percentage exceeds the limit, in case if it trips but does not exceed then will we just follow the previous approach.

Bukhtawar commented 2 years ago

There are two things here and feel free to split this into separate issues

  1. Derived durability of circuit breaker : TRANSIENT vs PERMANENT based on TBA as below https://github.com/opensearch-project/OpenSearch/blob/996d33adb22ac6a523962166f3677f8dc6ac9c1f/server/src/main/java/org/opensearch/script/ScriptCache.java#L165-L202
  2. The segment memory is captured as a part of accounting circuit breaker 185.5mb in this case. We would need to figure out discrepancy in computation of segment memory
    Caused by: CircuitBreakingException[[parent] Data too large, data for [cluster:monitor/nodes/info[n]] would be [2061357276/1.9gb], which is larger than the limit of [2023548518/1.8gb], real usage: [2061355280/1.9gb], new bytes reserved: [1996/1.9kb], usages [request=0/0b, fielddata=0/0b, in_flight_request
    s=1996/1.9kb, accounting=194546028/185.5mb]]
meghasaik commented 2 years ago

Thank you Bukhtawar for specifying this out. As mentioned, this can be a meta issue where it can be broken into two parts of:

  1. Investigating and figuring out how the segment memory is computed and what is causing the segment memory failure during its computation.
  2. Through the token bucket code snippet provided, fixing up the durability part.