opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size
https://opensearch.org/docs/latest/im-plugin/index/
Apache License 2.0
52 stars 107 forks source link

[BUG] Timeout on force_merge #1193

Open disaster37 opened 2 weeks ago

disaster37 commented 2 weeks ago

What is the bug? I Have ISM policy for hot / warm / delete topology and use data stream index to ingest logs. On warm phase, I have a force_merge step set to 1, but this step always finished on timeout.

How I can know why this step stuck and finished to timeout ?

How can one reproduce the bug?

I have tested it on Opensearch 2.14.0 (form docker container)

Here, my policy

{
    "id": "policy-log",
    "seqNo": 177028,
    "primaryTerm": 10,
    "policy": {
        "policy_id": "policy-log",
        "description": "Policy for logs index",
        "last_updated_time": 1718199759103,
        "schema_version": 21,
        "error_notification": null,
        "default_state": "hot",
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "rollover": {
                            "min_index_age": "1d",
                            "min_primary_shard_size": "5gb",
                            "copy_alias": false
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm",
                        "conditions": {
                            "min_index_age": "0d"
                        }
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "allocation": {
                            "require": {
                                "temp": "warm"
                            },
                            "include": {},
                            "exclude": {},
                            "wait_for": false
                        }
                    },
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "index_priority": {
                            "priority": 50
                        }
                    },
                    {
                        "timeout": "1d",
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "force_merge": {
                            "max_num_segments": 1
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "1d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": [
            {
                "index_patterns": [
                    "logs-log-*"
                ],
                "priority": 100,
                "last_updated_time": 1718199759103
            }
        ]
    }
}

What is the expected behavior?

It merge segment to 1 instead to stuck on step failed with timeout.

What is your host/environment?

Do you have any screenshots? If applicable, add screenshots to help explain your problem.

Do you have any additional context? Add any other context about the problem.