opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size
https://opensearch.org/docs/latest/im-plugin/index/
Apache License 2.0
53 stars 112 forks source link

[BUG] ISM force_merge on datastream index #1255

Open disaster37 opened 2 months ago

disaster37 commented 2 months ago

What is the bug?

On Opensearch 2.16.0

I have created ISM policy, with force_merge step to force to have one segment after the datastream index has rolled out and move to warm node. The step always finished on timeout. After put ISM log level to DEBUG, I get the following logs:

{"type": "json_logger", "timestamp": "2024-09-16T14:04:13,248Z", "level": "DEBUG", "component": "o.o.i.i.s.f.WaitForForceMergeStep", "cluster.name": "logmanagement2-rec", "node.name": "opensearch-data-os-2", "message": "Force merge still running on [.ds-logs-log-default-000617] with [2] shards containing unmerged segments", "cluster.uuid": "ZbghcuYqTtWRmCHMd4tbyw", "node.id": "cYyrcay5QPS7_zi0HxvyJg"  }

How can one reproduce the bug?

  1. Create new Opensearch cluster with hot and warm tiers
  2. Create Index template to allow create datastream index
    {
    "index_patterns": [
    "logs-*"
    ],
    "priority": "500",
    "data_stream": {
    "timestamp_field": {
      "name": "@timestamp"
    }
    },
    "name": "template_log",
    "template": {}
    }
  3. Create datastream index logs-log-default
  4. Create ISM policy
    {
    "id": "policy-log",
    "seqNo": 2848481,
    "primaryTerm": 23,
    "policy": {
        "policy_id": "policy-log",
        "description": "Policy for logs index",
        "last_updated_time": 1725961147454,
        "schema_version": 21,
        "error_notification": null,
        "default_state": "hot",
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "rollover": {
                            "min_index_age": "1d",
                            "min_primary_shard_size": "5gb",
                            "copy_alias": false
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm",
                        "conditions": {
                            "min_index_age": "1d"
                        }
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "read_only": {}
                    },
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "allocation": {
                            "require": {
                                "temp": "warm"
                            },
                            "include": {},
                            "exclude": {},
                            "wait_for": false
                        }
                    },
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "index_priority": {
                            "priority": 50
                        }
                    },
                    {
                        "timeout": "1d",
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "force_merge": {
                            "max_num_segments": 1
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "2d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": [
            {
                "index_patterns": [
                    "logs-log-*"
                ],
                "priority": 100,
                "last_updated_time": 1725961147454
            }
        ]
    }
    }

Wait Force merge step. The force_merge step always in timeout.

What is the expected behavior? Force merge run successfully on get one segment per shard.

What is your host/environment?

Opensearch 2.16.0

disaster37 commented 2 months ago

I finnaly found a right log on data node that host the last shard without merge segments. "Caused by: java.io.IOException: No space left on device",

disaster37 commented 2 months ago

I think the force_merge setp must be estimate the target size to look if there are sufficious space on node. And in all case, the setp must be failed because node space left on device instead to failed with Action time out

bharath-techie commented 1 month ago

@disaster37 did you try explain API to get the information on the policy failure ?

dblock commented 1 month ago

[Catch All Triage - 1, 2, 3, 4]