opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size
https://opensearch.org/docs/latest/im-plugin/index/
Apache License 2.0
53 stars 108 forks source link

[BUG] Action timed out and retry not consumed #1087

Open kksaha opened 5 months ago

kksaha commented 5 months ago

What is the bug? I have several indices with Action timed out.

Here is my policy: "default_state": "hot", "states": [ { "name": "hot", "actions": [ { "timeout": "4h", "retry": { "count": 10, "backoff": "exponential", "delay": "20m" }, "rollover": { "min_size": "90gb", "min_index_age": "7d", "min_primary_shard_size": "30gb", "copy_alias": false } } ], "transitions": [ { "state_name": "snapshot", "conditions": { "min_rollover_age": "14d" } } ] }, { "name": "snapshot", "actions": [ { "timeout": "10h", "retry": { "count": 5, "backoff": "exponential", "delay": "2h" }, "snapshot": { "repository": "KK_data_repository", "snapshot": "{{ctx.index}}" } } ], "transitions": [ { "state_name": "delete" } ] }, { "name": "delete", "actions": [ { "retry": { "count": 100, "backoff": "exponential", "delay": "10m" }, "delete": {} } ], "transitions": [] } ],

and despite the timeout configuration, we've got Action timeout for several indices that use that policy. And it looks like the system didn't perform any retries:

Here is explain output:

"state": { "name": "hot", "start_time": 1706450388940 }, "action": { "name": "rollover", "start_time": 1706450636772, "index": 0, "failed": true, "consumed_retries": 0, "last_retry_time": 0 }, "step": { "name": "attempt_rollover", "start_time": 1706450636772, "step_status": "condition_not_met" }, "retry_info": { "failed": false, "consumed_retries": 0 }, "info": { "message": "Action timed out" }

What is your host/environment?

Do you have any screenshots?

Screenshot 2024-02-01 at 3 04 10 PM
kksaha commented 5 months ago

Can anyone please suggest.

Juliaj commented 3 months ago

We're hitting this intermittently as well. From the output listed above

"name": "hot", "start_time": 1706450388940 -> Sunday, January 28, 2024 1:59:48.940 PM
"attempt_rollover", "start_time": 1706450636772 -> Sunday, January 28, 2024 2:03:56.772 PM

"step_status": "condition_not_met" indicated that rollover condition hadn't been met, thus rollover action shouldn't be triggered. But how was this tied to "Action timed out"?

Juliaj commented 3 months ago

Found a previous issue related to this https://github.com/opensearch-project/index-management/issues/315

dblock commented 4 weeks ago

Looks like this is still a problem/bug. Catch All Triage - 1 2 3 4 5