Open sam-herman opened 1 year ago
Tagging the meta issue https://github.com/opensearch-project/OpenSearch/issues/8098 for better visibility.
I can work on this
@samuel-oci
Thanks for putting up this thought, this is interesting as it'll help in overall resiliency of this particular flow. Currently we're also trying to reduce the size of cache itself by going deep one level down and removing unnecessary data present in cache like number of DiscoveryNode objects as @Bukhtawar pointed out about meta issue for improving this area.
Quick question
Cache will have limits and will start evict older entries when reaching the limits
We're still clearing the cache after shard is started. And it makes sense to keep ShardEntry in cache to avoid re-fetching the metadata from data nodes. Evicting an entry for which shard initialize is not called yet will require us to do async fetch again for that shard. Want to understand your thoughts on this.
Hey @amkhar that's correct understanding of the proposal. You are correct to point out that without sufficient memory for cache there will be repeated asyncFetch requests to retrieve the same entry. This for sure is a tradeoff proposed in the event where the cache overfill the heap. I believe @gtahhan had done some experimentation and found that eviction penalty of refetching is manageable. Perhaps he can provide more details.
Hi @gtahhan are you still working on this ?
@gtahhan Please let us know in case you are still working on this.
Hi @rwali-aws @amkhar sorry I didnt get the chance to work on it as of now... if its urgent, please you can un assign it from myself... Thank you and sorry again
Describe the bug Currently AsyncShardFetch requests triggered during reroute in the GatewayAllocator have a cache that is not bounded and eventually can cause the ClusterManager to run out of heap memory when clusters with multiple nodes and shards are starting up after a ClusterManager restart.
To Reproduce When provisioning large clusters with multiple nodes and shards you can take a heap dump and observe that the GatewayAllocator retain most of the heap.
Expected behavior