opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.48k stars 1.74k forks source link

[Enhancement] Optimize FieldDataCache removal flow #13862

Open sgup432 opened 3 months ago

sgup432 commented 3 months ago

Is your feature request related to a problem? Please describe

FieldDataCache is a node level cache and uses a composite key (indexFieldCache + shardId + IndexReader.CacheKey) to uniquely identify an item. IndexFieldCache further contains fieldName and index level details.

As of today, any item in fieldDataCache is removed in blocking/sync manner. This happens in below scenarios: 1) Invalidation: During refresh of any indexShard, the desired key is immediately invalidated in a sync manner here 2) Index deletion: During removal of an index, all related indexShard fields are removed. This is also done in a sync manner. See this. Here we iterate overall ALL cache keys, check whether key belongs to this index and then delete it. Highly inefficient.

Problem

We already have had issues where during index removal, data node dropped as a lot of time/cpu was taken up in clearing up fieldDataCache. Scenario(observed in production): Cluster manager node sends cluster state update task to data node on index removal. Data node starts clearing up fieldData cache on same clusterStateApplier(clusterApplierService#updateTask) thread, taking a lot of time(due to large cache size and inefficient all key traversal) and unable to acknowledge back to cluster manager node. This eventually resulted in this data node being removed from cluster.

Sample hot thread dump observed

100.4% (501.8ms out of 500ms) cpu usage by thread 'opensearch[46a4b13b820a8bcf60ac8f1de15cee14][clusterApplierService#updateTask][T#1]'
     10/10 snapshots sharing following 22 elements
       app//org.opensearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.clear(IndicesFieldDataCache.java:219)
       app//org.opensearch.index.fielddata.IndexFieldDataService.clear(IndexFieldDataService.java:107)
       app//org.opensearch.index.fielddata.IndexFieldDataService.close(IndexFieldDataService.java:179)
       app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:87)
       app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:129)
       app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:79)
       app//org.opensearch.index.IndexService.close(IndexService.java:362)
       app//org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:887)
       app//org.opensearch.indices.cluster.IndicesClusterStateService.removeIndices(IndicesClusterStateService.java:410)
       app//org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:255)
       app//org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:591)
       app//org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:578)
       app//org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:546)
       app//org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:469)
       app//org.opensearch.cluster.service.ClusterApplierService.access$000(ClusterApplierService.java:81)
       app//org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:180)

Describe the solution you'd like

As a solution, I suggest we should do following: 1) Cache clear/invalidation logic should be done in an async manner like we do in RequestCache. Here on any index removal, we will hold all such stale indices in a list and eventually clean them up on a separate thread in some X interval. 2) During any item removal in the cache, multiple removal listeners are invoked in a sync manner to update stats. We should also try doing this in an async manner on a different thread. 3) We should avoid going through all the entries in the cache event though we need to delete entries only for 1 or few indices. This can be done either by deleting specific entries for an index by calling invalidateAll for set of keys or delaying this until a later point where we want to delete multiple indices. 4) Consider integrating FieldDataCache with TieredCache(heap + disk)/DiskCache considering current default onHeapCache size is unlimited and only controlled by field data circuit breakers. This would clear up a lot of heap.

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

jainankitk commented 3 months ago

@sgup432 - Thank you for describing the issue and potential solution. Few considerations:

jainankitk commented 3 months ago

@sohami @msfroh @Bukhtawar - Is there any specific reason for field data cache cleanup happening on the cluster applier thread?

sgup432 commented 3 months ago

Currently, the field data cache cleanup is happening on the cluster applier thread. This will prevent the data from acknowledging new cluster state version in case of failure. While this does seem bit extreme and undesirable, I am not sure how we handle the failures in case of async cleanup

Yeah even I thought about it. I didn't see any reason as to why we should fail data node's cluster state update logic during index removal in case underlying fieldDataCache clean up fails. Considering index was already removed from the node, worst case would be stale entries lying in the cache in case we just swallowed up fieldCache cleanup exception.

So thought doing this in an async manner shouldn't be a problem as such IMO. But would like to hear other opinions as well as I might be wrong in my understanding.

Plus seems like field data cache cleanup won't likely throw any exception considering internal cache.invalidate seems safe even if a key is not present.

Bukhtawar commented 3 months ago

@sohami @msfroh @Bukhtawar - Is there any specific reason for field data cache cleanup happening on the cluster applier thread?

Any mutating operations to cluster state, including index creation/deletion needs to be applied to the local data structures of the node applying the state. Generally these appliers are processed sequentially and in a blocking manner to ensure that all local structures are successfully refreshed before a cluster state commit acknowledgement can be sent back. While I agree certain operations like the field data cache evicting isn't super-critical and doesn't require the node to hold back the acknowledgement. Also see https://github.com/opensearch-project/OpenSearch/issues/9996 where an index deletion clean up took sufficiently longer to unlink causing node drops.

I think we need to holistically look into this problem and start with a mechanism like soft-eviction or soft-deletes which just marks the entry as deleted or stale while the actual clean-up can happen in the background

peternied commented 3 months ago

[Triage - attendees 1 2 3 4 5] @sgup432 Thanks for creating this issue, looking forward to a pull request to improve OpenSearch.