opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size
https://opensearch.org/docs/latest/im-plugin/index/
Apache License 2.0
53 stars 112 forks source link

[BUG] Lock acquire/release for snapshot jobs emit warning and error logs #1199

Open spapadop opened 4 months ago

spapadop commented 4 months ago

Describe the bug

I have a simple snapshot management job running daily:

{
    "name": "daily-monit-qa-backup_v2",
    "description": "Daily snapshot policy",
    "schema_version": 19,
    "creation": {
      "schedule": {
        "cron": {
          "expression": "0 12 * * *",
          "timezone": "Europe/Zurich"
        }
      }
    },
    "deletion": {
      "schedule": {
        "cron": {
          "expression": "0 12 * * *",
          "timezone": "Europe/Zurich"
        }
      },
      "condition": {
        "max_age": "3d",
        "min_count": 1
      }
    },
    "snapshot_config": {
      "indices": "monit_qa*",
      "ignore_unavailable": true,
      "repository": "s3-monitqa1-bucket",
      "partial": true
    },
    "schedule": {
      "interval": {
        "start_time": 1718098008729,
        "period": 1,
        "unit": "Minutes"
      }
    },
    "enabled": true,
    "last_updated_time": 1718709061300,
    "enabled_time": 1718098008729
}

Everyday it produces around 70 warning logs like:

Cannot acquire lock for snapshot management job daily-monit-qa-backup_v2

followed by 2 error logs:

Could not release lock [.opendistro-ism-config-daily-monit-qa-backup_v2-sm-policy] for daily-monit-qa-backup_v2-sm-policy.

These two error logs cause two failure notifications, if the notification channel for this is configured.

However, the snapshot has actually a "SUCCESS" status, so these logs seem rather insignificant or at least not worth further digging, as my snapshot is successful. Not sure what should happen here, I guess either "downgrade" these logs significance to "INFO" or "DEBUG", but definitely not "ERROR" as this makes the Failure notification functionality non-reliable.

Related component

Storage:Snapshots

To Reproduce

  1. Create a daily snapshot policy to an s3 bucket, like the one I specify above.
  2. Observe the WARN/ERROR logs emited when the snapshot is getting created accordingly.

Expected behavior

If the snapshot is successful, it should produce no ERROR logs.

Additional Details

Plugins All default ones + repository-s3

Host/Environment (please complete the following information):

Additional context Tested on OpenSearch v2.11.1

dblock commented 3 months ago

Thanks for opening this @spapadop.

[Catch All Triage w/ 1, 2, 3]

jetnet commented 3 days ago

just in case, repeating my message from the issue mentioned above:

It happens in our environment after upgrading from 2.12 to 2.17.1. Now, the warning appears after every OS restart. No new snapshots get created. Workaround: delete the snapshot policies and create them again (just updating with the same content/config does not help).