opensearch-project / cross-cluster-replication

Synchronize your data across multiple clusters for lower latencies and higher availability
https://opensearch.org/docs/latest/replication-plugin/index/
Apache License 2.0
48 stars 58 forks source link

[BUG] Index settings sync from leader fails for Remote Store enabled follower #1300

Closed linuxpi closed 8 months ago

linuxpi commented 9 months ago

What is the bug? While index settings are being synced from leader domain(OS1.3) to follower domain(OS_2.11 | Remote Store) , we get the following error

[2023-12-11T19:03:20,322][INFO ][o.o.r.t.i.IndexReplicationTask] [bba39cc188fdc55b7516a083b0f7b5ad] [aggregate_email_event_automation] Closed the index aggregate_email_event_automation to apply static settings now
[2023-12-11T19:03:20,385][INFO ][o.o.g.r.RemoteClusterStateService] [bba39cc188fdc55b7516a083b0f7b5ad] writing cluster state for version [143] took [60ms]; wrote metadata for [1] indices and skipped [8] unchanged indices, global metadata updated : [false]
[2023-12-11T19:03:20,433][INFO ][o.o.p.PluginsService     ] [bba39cc188fdc55b7516a083b0f7b5ad] PluginService:onIndexModule index:[aggregate_email_event_automation/01ayit5a7KRtaDUCmA4ImcBQ]
[2023-12-11T19:03:20,612][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [bba39cc188fdc55b7516a083b0f7b5ad] Detected cluster change event for destination migration
[2023-12-11T19:03:20,613][ERROR][o.o.r.m.TransportUpdateMetadataAction] [bba39cc188fdc55b7516a083b0f7b5ad] failed to update settings on index aggregate_email_event_automation
java.lang.IllegalArgumentException: final aggregate_email_event_automation setting [index.replication.type], not updateable
        at org.opensearch.common.settings.AbstractScopedSettings.updateSettings(AbstractScopedSettings.java:871)
        at org.opensearch.common.settings.AbstractScopedSettings.updateSettings(AbstractScopedSettings.java:824)
        at org.opensearch.cluster.metadata.MetadataUpdateSettingsService$1.execute(MetadataUpdateSettingsService.java:300)
        at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:65)
        at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:880)
        at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:432)
        at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:299)
        at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:210)
        at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:209)
        at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:247)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:858)
        at [2023-12-11T19:03:20,616][INFO ][o.o.c.m.MetadataIndexStateService] [bba39cc188fdc55b7516a083b0f7b5ad] opening indices [[aggregate_email_event_automation/01ayit5a7KRtaDUCmA4ImcBQ]]

It seems the issue happens due to a settings which is supported on follower but not on leader domain. During index settings sync such settings value may come as null since leader doesn't support it.

Another reason this could happen is if leader support setting the value of a setting as some value which is not supported in the follower domain. in the case of index.replication.type setting, leader could support setting it to null to allow it to fallback to the default value. But in the follower domain(Remote Store) we dont allow it to be set to null

Also, when this setting application fails, we still proceed and open the index.

https://github.com/opensearch-project/cross-cluster-replication/blob/be24bfaa9a795179662a6660633902a7737b717c/src/main/kotlin/org/opensearch/replication/task/index/IndexReplicationTask.kt#L619-L634

What is the expected behavior? Such index sync operations should be handled gracefully.

linuxpi commented 9 months ago

Another issue encountered in similar setup

[2023-12-19T12:08:30,471][INFO ][o.o.c.m.MetadataIndexStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] closing indices [test-index/017zsErQylTmmbIUQbina1kg]
[2023-12-19T12:08:30,512][INFO ][o.o.g.r.RemoteClusterStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] Delete stale cluster metadata task is already in progress.
[2023-12-19T12:08:30,512][INFO ][o.o.g.r.RemoteClusterStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] writing cluster state for version [1492] took [40ms]; wrote metadata for [0] indices and skipped [8] unchanged indices, global metadata updated : [false]
[2023-12-19T12:08:30,556][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] Detected cluster change event for destination migration
[2023-12-19T12:08:30,739][INFO ][o.o.c.m.MetadataIndexStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] completed closing of indices [test-index]
[2023-12-19T12:08:30,877][INFO ][o.o.g.r.RemoteClusterStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] Delete stale cluster metadata task is already in progress.
[2023-12-19T12:08:30,877][INFO ][o.o.g.r.RemoteClusterStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] writing cluster state for version [1493] took [135ms]; wrote metadata for [1] indices and skipped [7] unchanged indices, global metadata updated : [false]
[2023-12-19T12:08:30,937][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] Detected cluster change event for destination migration
[2023-12-19T12:08:30,938][INFO ][o.o.r.t.i.IndexReplicationTask] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index] Closed the index test-index to apply static settings now
[2023-12-19T12:08:31,024][INFO ][o.o.g.r.RemoteClusterStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] Delete stale cluster metadata task is already in progress.
[2023-12-19T12:08:31,024][INFO ][o.o.g.r.RemoteClusterStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] writing cluster state for version [1494] took [85ms]; wrote metadata for [1] indices and skipped [7] unchanged indices, global metadata updated : [false]
[2023-12-19T12:08:31,067][INFO ][o.o.p.PluginsService     ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] PluginService:onIndexModule index:[test-index/017zsErQylTmmbIUQbina1kg]
[2023-12-19T12:08:31,284][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] Detected cluster change event for destination migration
[2023-12-19T12:08:31,285][INFO ][o.o.c.m.MetadataUpdateSettingsService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] updating number_of_replicas to [2] for indices [test-index]
[2023-12-19T12:08:31,285][ERROR][o.o.r.m.TransportUpdateMetadataAction] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] failed to update settings on index test-index
java.lang.IllegalArgumentException: final test-index setting [index.replication.type], not updateable
        at org.opensearch.common.settings.AbstractScopedSettings.updateSettings(AbstractScopedSettings.java:871)
        at org.opensearch.common.settings.AbstractScopedSettings.updateSettings(AbstractScopedSettings.java:824)
        at org.opensearch.cluster.metadata.MetadataUpdateSettingsService$1.execute(MetadataUpdateSettingsService.java:300)
        at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:65)
        at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:880)
        at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:432)
        at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:299)
        at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:210)
        at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:209)
        at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:247)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:858)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
[2023-12-19T12:08:31,286][INFO ][o.o.c.m.MetadataIndexStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] opening indices [[test-index/017zsErQylTmmbIUQbina1kg]]
[2023-12-19T12:08:31,287][INFO ][o.o.p.PluginsService     ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] PluginService:onIndexModule index:[test-index/017zsErQylTmmbIUQbina1kg]
[2023-12-19T12:08:31,352][INFO ][o.o.i.s.IndexShard       ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][1] Downloaded translog and checkpoint files from=306 to=306
[2023-12-19T12:08:31,375][INFO ][o.o.g.r.RemoteClusterStateService] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] writing cluster state for version [1495] took [83ms]; wrote metadata for [1] indices and skipped [7] unchanged indices, global metadata updated : [false]
[2023-12-19T12:08:31,416][INFO ][o.o.i.t.RemoteFsTranslog ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][1] Downloaded translog and checkpoint files from=306 to=306
[2023-12-19T12:08:31,416][INFO ][o.o.i.t.RemoteFsTranslog ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][1] Downloaded data from remote translog till maxSeqNo = -1
[2023-12-19T12:08:31,443][INFO ][o.o.i.s.IndexShard       ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][4] Downloaded translog and checkpoint files from=306 to=306
[2023-12-19T12:08:31,506][INFO ][o.o.i.t.RemoteFsTranslog ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][4] Downloaded translog and checkpoint files from=306 to=306
[2023-12-19T12:08:31,507][INFO ][o.o.i.t.RemoteFsTranslog ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][4] Downloaded data from remote translog till maxSeqNo = -1
[2023-12-19T12:08:31,555][INFO ][o.o.i.s.IndexShard       ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][3] Downloaded translog and checkpoint files from=301 to=305
[2023-12-19T12:08:31,647][INFO ][c.a.c.e.logger           ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] GET /_cat/master h=id 200 OK 23 1
[2023-12-19T12:08:31,648][INFO ][c.a.c.e.logger           ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] GET /_nodes/_local/stats/discovery - 200 OK 4122 1
[2023-12-19T12:08:31,733][INFO ][o.o.i.t.RemoteFsTranslog ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][3] Downloaded translog and checkpoint files from=301 to=305
[2023-12-19T12:08:31,733][INFO ][o.o.i.t.RemoteFsTranslog ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][3] Downloaded data from remote translog till maxSeqNo = -1
[2023-12-19T12:08:32,335][INFO ][c.a.c.e.logger           ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] GET /_cluster/state/nodes filter_path=nodes.*.attributes.di_number 200 OK 69 0
[2023-12-19T12:08:32,337][INFO ][c.a.c.e.logger           ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] GET /_nodes/_local/process filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 77 1
[2023-12-19T12:08:32,779][INFO ][c.a.c.e.logger           ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] GET /_cluster/health local=true&timeout=4s 200 OK 469 1
[2023-12-19T12:08:34,133][INFO ][o.o.i.s.IndexShard       ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] [test-index][0] Downloaded translog and checkpoint files from=115 to=212
[2023-12-19T12:08:34,836][INFO ][c.a.c.e.logger           ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] GET /_cluster/state/nodes filter_path=nodes.*.attributes.di_number 200 OK 69 1
[2023-12-19T12:08:34,837][INFO ][c.a.c.e.logger           ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] GET /_nodes/_local/process filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 77 1
[2023-12-19T12:08:35,892][ERROR][o.o.r.a.s.TransportReplicationStatusAction] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] got Exception while querying for status 
[test-index/017zsErQylTmmbIUQbina1kg] IndexClosedException[closed]
        at org.opensearch.cluster.metadata.IndexNameExpressionResolver.shouldTrackConcreteIndex(IndexNameExpressionResolver.java:427)
        at org.opensearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:383)
        at org.opensearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:276)
        at org.opensearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:150)
        at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction.resolveConcreteIndexNames(TransportBroadcastByNodeAction.java:268)
        at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.<init>(TransportBroadcastByNodeAction.java:305)
        at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:273)
        at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:92)
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218)
        at org.opensearch.indexmanagement.controlcenter.notification.filter.IndexOperationActionFilter.apply(IndexOperationActionFilter.kt:39)
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216)
        at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:118)
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216)
        at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:78)
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216)
        at org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:395)
        at org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:165)
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216)
        at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188)
        at org.opensearch.action.support.TransportAction.execute(TransportAction.java:107)
        at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:110)
        at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:97)
        at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:476)
        at org.opensearch.replication.util.CoroutinesKt$suspendExecute$2.invokeSuspend(Coroutines.kt:80)
        at org.opensearch.replication.util.CoroutinesKt$suspendExecute$2.invoke(Coroutines.kt)
        at org.opensearch.replication.util.CoroutinesKt$suspendExecute$2.invoke(Coroutines.kt)
        at kotlinx.coroutines.intrinsics.UndispatchedKt.startUndispatchedOrReturn(Undispatched.kt:89)
        at kotlinx.coroutines.BuildersKt__Builders_commonKt.withContext(Builders.common.kt:165)
        at kotlinx.coroutines.BuildersKt.withContext(Unknown Source)
        at org.opensearch.replication.util.CoroutinesKt.suspendExecute(Coroutines.kt:79)
        at org.opensearch.replication.util.CoroutinesKt.suspendExecute$default(Coroutines.kt:75)
        at org.opensearch.replication.action.status.TransportReplicationStatusAction$doExecute$1.invokeSuspend(TransportReplicationStatusAction.kt:62)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.UndispatchedCoroutine.afterResume(CoroutineContext.kt:147)
        at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
[2023-12-19T12:08:35,892][WARN ][r.suppressed             ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] path: /_plugins/_replication/test-index/_status, params: {pretty=true, index=test-index}
ReplicationException[failed to fetch replication status]
        at org.opensearch.replication.action.status.TransportReplicationStatusAction$doExecute$1.invokeSuspend(TransportReplicationStatusAction.kt:104)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.UndispatchedCoroutine.afterResume(CoroutineContext.kt:147)
        at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
[2023-12-19T12:08:35,892][INFO ][c.a.c.e.logger           ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] GET /_plugins/_replication/test-index/_status pretty=true 500 INTERNAL_SERVER_ERROR 272 2
[2023-12-19T12:08:36,803][INFO ][c.a.c.e.logger           ] [f2b8edd8d1ba00bcc7a9cf8ba945d2e0] GET /_nodes/_local/stats filter_path=nodes.*.task_cancellation 200 OK 140 4

After updating any index settings in the setup where leader is on an older version, follower starts throwing above exception when trying to get replication status of the index

nisgoel-amazon commented 8 months ago

@monusingh-1 Assign this to me

monusingh-1 commented 8 months ago

Thanks @nisgoel-amazon for closing this.