opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.76k stars 1.82k forks source link

[Snapshot Interop] Shallow copy snapshots failing for closed indices #13805

Open harishbhakuni opened 5 months ago

harishbhakuni commented 5 months ago

Describe the bug

We recently found out a issue where shallow copy snapshots are failing for closed indices. However full copy snapshots succeeds for those indices.

Snapshot shard failed
java.nio.file.NoSuchFileException: Metadata file is not present for given primary term 2 and generation 6
    at org.opensearch.index.store.RemoteSegmentStoreDirectory.getMetadataFileForCommit(RemoteSegmentStoreDirectory.java:527)
    at org.opensearch.index.store.RemoteSegmentStoreDirectory.acquireLock(RemoteSegmentStoreDirectory.java:480)
    at org.opensearch.index.shard.IndexShard.acquireLockOnCommitData(IndexShard.java:1655)
    at org.opensearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:631)
    at org.opensearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:393)
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractPrioritizedRunnable.doRun(ThreadContext.java:979)
    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)

For shallow copy snapshots, we refer latest remote store data and acquire a lock on that data. since the indices are closed no new data is being written to remote store which should get triggered as part of snapshot flush. this is causing snapshots to fail.

Related component

Storage:Snapshots

To Reproduce

  1. Create a remote store enabled cluster.
  2. Create indices and close them.
  3. Register a snapshot repository and enable shallow copy snapshots or use system repository created during cluster creation.
  4. Trigger snapshot, it will fail.

Expected behavior

Snapshots should pass.

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Additional context Add any other context about the problem here.

peternied commented 5 months ago

[Triage - attendees 1 2 3 4 5 6 @harishbhakuni Thanks for creating this issue, we would welcome a pull request to address this bug

sachinpkale commented 5 months ago

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 ]

Added release target 2.16

moritzzimmer commented 3 weeks ago

We face the same issue on AWS OpenSearch service. Unfortunately this also seems to prevent domain upgrades, since the service tries to make a shallow snapshot first.

As a workaround, we needed to (re-) open and/or delete closed indices to be able to upgrade our domain.