opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size
https://opensearch.org/docs/latest/im-plugin/index/
Apache License 2.0
53 stars 111 forks source link

Previous action was not able to update IndexMetaData #33

Open adityaj1107 opened 3 years ago

adityaj1107 commented 3 years ago

Issue by arnitolog Tuesday Nov 26, 2019 at 08:05 GMT Originally opened as https://github.com/opendistro-for-elasticsearch/index-management/issues/116



Note Please read this reply to understand the reason of error


Hello, I noticed that several indexes have status Failed with the error: "Previous action was not able to update IndexMetaData". I think it happens after data nodes restart, but not sure. Is there any way to configure automatic retry for such error. My policy is below: { "policy": { "policy_id": "ingest_policy", "description": "Default policy", "last_updated_time": 1574686046552, "schema_version": 1, "error_notification": null, "default_state": "ingest", "states": [ { "name": "ingest", "actions": [], "transitions": [ { "state_name": "search", "conditions": { "min_index_age": "4d" } } ] }, { "name": "search", "actions": [ { "timeout": "2h", "retry": { "count": 5, "backoff": "constant", "delay": "1h" }, "force_merge": {"max_num_segments": 1 } } ], "transitions": [ { "state_name": "delete", "conditions": {"min_index_age": "30d"} } ] }, { "name": "delete", "actions": [ { "timeout": "2h", "retry": { "count": 5, "backoff": "constant", "delay": "1h" }, "delete": {} } ], "transitions": [] } ] } }

adityaj1107 commented 3 years ago

Comment by dbbaughe Tuesday Nov 26, 2019 at 18:26 GMT


Hi @arnitolog,

At which action or step is the error occurring at?

That error is from: https://github.com/opendistro-for-elasticsearch/index-management/blob/4eea94fe30627c461f84a86815c30a63e5ab8d20/src/main/kotlin/com/amazon/opendistroforelasticsearch/indexstatemanagement/ManagedIndexRunner.kt#L265

Which basically means that one of the executions attempted to "START" the step being executed, but was never able to finish it. It's possible if your data nodes restart during the middle of that execution time period.

We currently don't have an automatic retry for this specific part, because we don't know if the step finished or not, and if it's something non-idempotent then we don't want to retry it which is why we turn it over to the user to handle.

With that in mind, we could definitely add automatic retries on things that are idempotent/safe to eliminate the majority of cases this can happen in (like checking conditions for transitioning etc.).

adityaj1107 commented 3 years ago

Comment by arnitolog Wednesday Nov 27, 2019 at 06:19 GMT


Hi @dbbaughe this can happen on different steps. I saw this error on "ingest" step (which is the first one) and on "search" (which is the second)

It will be good to have some retriable mechanism fo such cases, so the less manual work the better

adityaj1107 commented 3 years ago

Comment by dbbaughe Friday May 08, 2020 at 02:00 GMT


Some improvements that have been added to help with this:

https://github.com/opendistro-for-elasticsearch/index-management/pull/165 https://github.com/opendistro-for-elasticsearch/index-management/pull/209

We have a few further ideas that we will track in: https://github.com/opendistro-for-elasticsearch/index-management/issues/207

adityaj1107 commented 3 years ago

Comment by gittygoo Tuesday Jun 30, 2020 at 15:19 GMT


This is still happening on opendistro 1.8.0 release Strangely enough alot of them just stay on "running"/"attempting to transition" also ism

adityaj1107 commented 3 years ago

Comment by dbbaughe Tuesday Jun 30, 2020 at 15:46 GMT


Hey @gittygoo,

Are you using this plugin independently or using ODFE 1.8? What's your cluster setup look like? Are the "Attempting to transition/Running" stuck even though the conditions are met? If so what are those conditions? Can you check if your cluster pending tasks are backed up: GET /_cluster/pending_tasks

Thanks

adityaj1107 commented 3 years ago

Comment by gittygoo Tuesday Jun 30, 2020 at 17:50 GMT


@dbbaughe its an internal cluster with 2 nodes, using Opendistro 1.8

policy looks like this, this should rotate them daily until deletion... so yes conditions are met

{
    "policy": {
        "policy_id": "default_ism_policy",
        "description": "Default policy",
        "last_updated_time": 1590706756863,
        "schema_version": 1,
        "error_notification": null,
        "default_state": "hot",
        "states": [
            {
                "name": "hot",
                "actions": [],
                "transitions": [
                    {
                        "state_name": "warm",
                        "conditions": {
                            "min_index_age": "1d"
                        }
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [],
                "transitions": [
                    {
                        "state_name": "cold",
                        "conditions": {
                            "min_index_age": "2d"
                        }
                    }
                ]
            },
            {
                "name": "cold",
                "actions": [],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "3d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "delete": {}
                    }
                ],
                "transitions": []
            }
        ]
    }
}

Tasks are empty

{"tasks":[]}
adityaj1107 commented 3 years ago

Comment by dbbaughe Tuesday Jun 30, 2020 at 18:03 GMT


Hi @gittygoo,

A few things to check:

Thanks

adityaj1107 commented 3 years ago

Comment by gittygoo Tuesday Jun 30, 2020 at 18:18 GMT


So here is an example:

Anything else i should check ?

adityaj1107 commented 3 years ago

Comment by dbbaughe Tuesday Jun 30, 2020 at 18:47 GMT


@gittygoo, you can try to set the log level to debug and see if any logs pop up. Otherwise we can try to jumpstart the job scheduler and see if it starts working again. The job scheduler plugin will reschedule any job when either the job document is updated or the shard moves to a different node and needs to be rescheduled on the new node. So you can either manually move the .opendistro-ism-config index shards to a different node to force it or manually update the managed_index documents in that index (probably something like changing enabled to false and back to true). Unfortunately we don't have an API to forcefully reschedule jobs.. it can be something we take as an action item to add.

adityaj1107 commented 3 years ago

Comment by gittygoo Tuesday Jun 30, 2020 at 20:35 GMT


The way i connect the indexes to the ism template is via the index templates. So i can assume removing all the current "Managed indices" and then waiting 3 more days to see if the rotations went fine should achieve the same as your "jumpstart" idea? since the new indexes would automatically be assigned that policy based on their names. if so i will proceed to delete them and wait

adityaj1107 commented 3 years ago

Comment by dbbaughe Tuesday Jun 30, 2020 at 22:44 GMT


If you removed the current policy_ids from the indices it would delete the internal jobs (Managed Indices). And then you could try re-adding them to those indices and see if it goes through. Not sure if I followed the "waiting 3 more days" part.

adityaj1107 commented 3 years ago

Comment by OrangeTimes Wednesday Jul 01, 2020 at 11:00 GMT


We are experiencing the same issues

adityaj1107 commented 3 years ago

Comment by dbbaughe Wednesday Jul 01, 2020 at 16:00 GMT


Hi @OrangeTimes,

The same issue as in "Previous action not able to update IndexMetaData" or similar to gittygoo where they have jobs that don't appear to be running anymore?

Can you also give a bit more information about what your cluster setup looks like (ODFE vs AmazonES, what version, # of nodes, etc.) and any more details about the issue you're experiencing.

adityaj1107 commented 3 years ago

Comment by OrangeTimes Friday Jul 03, 2020 at 12:53 GMT


@dbbaughe similar to gittygoo Some indices are in Active and some in Failed state. Our index managament page looks pretty much the same

adityaj1107 commented 3 years ago

Comment by samling Tuesday Jul 07, 2020 at 18:58 GMT


Experiencing the same issue here, though possibly partly our own doing. We switched to ODFE last night and blanket-applied a policy to our existing indices, then very quickly decided to apply a different policy instead. This morning I checked the indices and about 90% of them show "Previous action was not able to update IndexMetaData", with the last action being Force Merge. Tried retrying the failed step but that didn't work, now I'm trying removing the policy altogether and reapplying it to try and jog the index.

Edit: This didn't work either, nor did retrying the policy from a specified state. Any more suggestions to debug or jog things are appreciated as we're now stuck with quite a lot of indices in this failed state.

Here's a little more info on our setup: ODFE v1.8.0 7 nodes (6 hot, 1 cold) Our policy transitions indices to the cold node first in a warm state after 2 days, then to a cold state after either a week or a month depending on the policy. During the warm phase the indices are force-merged, replicas removed, made read-only, and reallocated in that order.

Not sure if removing and attaching a different policy before the first one was complete is what broke things, but whatever the cause I've not yet been able to fix them. Happy to provide any additional information.

StefanSa commented 2 years ago

Hi There, We have encountered the same problem with data stream index. image

{
    "policy_id": "filebeat",
    "description": "data stream filebeat",
    "last_updated_time": 1640865313616,
    "schema_version": 12,
    "error_notification": null,
    "default_state": "hot",
    "states": [
        {
            "name": "hot",
            "actions": [
                {
                    "rollover": {
                        "min_size": "50gb",
                        "min_index_age": "30d"
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "cold"
                }
            ]
        },
        {
            "name": "cold",
            "actions": [
                {
                    "read_only": {}
                },
                {
                    "index_priority": {
                        "priority": 0
                    }
                },
                {
                    "force_merge": {
                        "max_num_segments": 1
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "delete",
                    "conditions": {
                        "min_index_age": "90d"
                    }
                }
            ]
        },
        {
            "name": "delete",
            "actions": [
                {
                    "delete": {}
                }
            ],
            "transitions": []
        }
    ],
    "ism_template": [
        {
            "index_patterns": [
                "filebeat-*"
            ],
            "priority": 50,
            "last_updated_time": 1639485301867
        }
    ]
}

Any help here ?

dbbaughe commented 2 years ago

@StefanSa What version of the plugin and cluster are you running?

StefanSa commented 2 years ago

@dbbaughe hi last official opensearch version 1.2.3. which plugin version is really installed there ,i can't tell you at the moment, because i don't have access to the system.

PS: 👍 what a great shepherd, i had one of those once.

dbbaughe commented 2 years ago

@StefanSa Got it, so this error basically manes that ISM could not successfully update the metadata after it performed an action (force merge in this case). Normally for most actions, we have them as idempotent and it's fine if the metadata failed to update, because we can just execute the action again and try again. But, for force merge attempting to do force merge again will actually queue up another force merge which is not idempotent. So in this case, if our update metadata failed and we don't know now was the force merge successful or did it fail then we enter into this failed to update metadata state and require the user to check it out and retry if that's what they want (or skip to the next state or something else).

As for why this happens.. you would have to check the logs. In this case the version you're on should have the metadata stored in the index, so if you could check the timestamp for when it failed (should be available on the history document if you have that enabled, I don't remember off top of my head if it's in explain response) then you could check the logs for that time and see what the error was (i.e. why did the index write fail). Perhaps it's something that can be fixed (is cluster underscaled, did index have a block on it temporarily or cluster had a global block for some reason like disk space?) or was it just a transient failure.

PS: Thanks! She's half german shepherd, half siberian husky!

Zhangxunmt commented 2 years ago

Is there a final summary or conclusion for this issue? We are having customers complaining about the same errors.

rishabhmaurya commented 1 year ago

Same happens with attemptRollover step, which is not idempotent. Ideally it should be retried instead of disabling the managed index config and skipping the step https://github.com/opensearch-project/index-management/blob/19fc44b62274e6a3f403b7fbc1207a42e2ca76b1/src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/step/rollover/AttemptRolloverStep.kt#L272

https://github.com/opensearch-project/index-management/blob/19fc44b62274e6a3f403b7fbc1207a42e2ca76b1/src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/ManagedIndexRunner.kt#L368