Open anshu1106 opened 2 weeks ago
Thanks @anshu1106 for filing this issue. It is an interesting one.
This looks like due to dynamic mapping, cluster state is changing often and hence that many different objects are present. One thing we should evaluate is instead of passing the whole cluster state what we sub object (indicesLookUp
?) we can pass to ConcreteIndices
constructor so that retained heap is not so much until the bulk request is processed.
Describe the bug
While analyzing a heap dump taken on a domain with large no. nodes and 200k shards, it is found that out of 16.1 GB, ~14.7 GB in the retained heap is due to TransportResponseHandlers. The dump is from a data node and there were _bulk queries running in the domain at the time when heap dump was captured.
Expanding TransportResponseHandler
On expanding a ConcurrentHashMap object, it is found that TransportBulkAction$ConcreteIndices is taking ~63 MB. Most of which is taken by ClusterState.https://github.com/opensearch-project/OpenSearch/blob/5e72e1df6d14d1eb5e24385a2c9c6bca96066a5d/server/src/main/java/org/opensearch/action/bulk/TransportBulkAction.java#L829
The histogram below shows 215 ClusterState object taking ~11 GB of heap.
The incoming object reference for most of the ClusterState object is TransportBulkAction$ConcreteIndices.
There seem to be a bug in TransportBulkAction path which is creating new ClusterState objects rather than referencing one.
Related component
Indexing:Performance
To Reproduce
Expected behavior
There has to be atmost 2 cluster state objects in the domain when updates are going on. TransportBulkAction should not create new
ClusterState
objects.Additional Details
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context Add any other context about the problem here.