opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size
https://opensearch.org/docs/latest/im-plugin/index/
Apache License 2.0
52 stars 107 forks source link

[FEATURE] Opt to run ISM jobs even on red cluster state #1127

Open aggarwalShivani opened 4 months ago

aggarwalShivani commented 4 months ago

Is your feature request related to a problem?

As mentioned in ISM docs and observed in my experiments,

ISM does not run jobs if the cluster state is red.

We use ISM policies (with delete action) to regularly delete old indices from the cluster and free up disk space. Inspite of regular cleanup, if there is high data ingestion, sometimes the cluster could go to red state due to low disk availability. In such cases, the problem is, even though we have setup ISM policies, they would not run and the cluster would not get recovered, unless someone manually runs the DELETE REST API to cleanup indices and recovering the cluster-state.

What solution would you like? There could be a new optional parameter (for ex. allow_red_cluster) for all actions or atleast the delete action - that a user could set, if they wish to run the specified ISM policies even on a red cluster.

mgodwan commented 3 months ago

@aggarwalShivani I think this sounds like a fair enough use case to allow within ISM policy executions. Would you be willing to raise a PR for this?

skumarp7 commented 2 months ago

Hi @mgodwan,

can you assign this to me ? I would like to work on this :)

mgodwan commented 2 months ago

Awesome @skumarp7 . Assigned to you.

skumarp7 commented 1 month ago

Hi @mgodwan ,

I was trying to add a new optional parameter ( ex: allow_red_cluster ] in the delete operation specifically under the action as few operations such as shrink, force_merge might not make sense to run on a red cluster. But this is currently a problem as if there are more states, there will not be any transitions and hence it won't reach to the state where the delete operation is defined and even though the parameter is set to true.

Should we add the optional parameter at the top level of the policy or do we have any another suggestion from you ?

aggarwalShivani commented 1 month ago

Hi @mgodwan and other maintainers, Any suggestions on the query asked by @skumarp7 ?

In my view, it may be misleading to add the "allow_red_cluster" parameter at the policy level if all actions do not / may not support running on a red-cluster.

One alternative approach could be adding this parameter at the transitions level in the policy. For ex.

"transitions": [
  {
    "state_name": "cold",
    "allow_red_cluster": true,
    "conditions": {
      "min_index_age": "30d"
    }
  }
]

This way, a user would have better control and enable this flag knowingly only where required i.e. only while transitioning to a next state that supports running on red-cluster.

dblock commented 2 weeks ago

Catch All Triage - 1 2 3 4 5

bowenlan-amzn commented 1 week ago

@aggarwalShivani I got a notification of the PR for this issue today.

One thing to keep in mind is ISM relies on the cluster status to function properly (ISM indexing and search from a index sit in the cluster). If the cluster itself is red, ISM thinks itself cannot function properly. It also worries if doing ISM operation in red cluster could make the situation even worse.

Normal flow to handle this for admin is to listen to the error notification, go fix the red cluster problem and retry the failed ISM jobs.

Thinking of the cause of your issue, high data ingestion cause more disk util and lead to red cluster. IMO, it would anyway requires some manual investigation, or this could be a common pattern which can be planned ahead by adding more storage. The manual investigation would be like why and how big is this spike? how many old indexes should I delete to balance that? Since ISM doesn't know the context of these, ISM cannot do the auto recover work well. So if in a red cluster, and ISM somehow knows the only problem is disk util high, then it probably makes sense to allow delete to run.

aggarwalShivani commented 1 week ago

Hi @bowenlan-amzn, We totally understand and agree with the concerns you've mentioned of running ISM on a red cluster.

That is why the proposal is to introduce an optional parameter (which would be false by default and not impact default behaviour). It would be user's decision to enable this feature only for required use-cases, for ex. only deletion to reduce the disk utilization. Also, this feature would help us to achieve functional parity with curator - which allows deletion of indices on a red cluster.

We have seen cases where users(admins) are not generally happy manually running REST APIs to cleanup old indices, even during red-cluster situations and expect the housekeeping utilities like curator/ism to manage the cleanup. This feature request is intended to provide that flexibility to the user, just in case.