Opensearch data pod and dashboards stuck unready

kralicky commented 1 year ago

Installed logging backend with the following settings:

The opni-data-0 pod is stuck in the unready status. This error is repeated in the pod logs:

[2023-05-24T01:09:48,327][INFO ][o.o.c.c.JoinHelper       ] [opni-data-0] failed to join {opni-bootstrap-0}{bHZn2zUyQo6UngXRqWpvpQ}{sKaY2RKMTM-zPwHTA3XcDg}{opni-bootstrap-0}{10.0.12.157:9300}{m}{shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={opni-data-0}{GG_7mvLkQdadVX92BVeV6w}{BkljvOTrQK6KRSr7kwMv9Q}{opni-data-0}{10.0.148.66:9300}{dim}{shard_indexing_pressure_enabled=true}, minimumTerm=2, optionalJoin=Optional.empty}
org.opensearch.transport.RemoteTransportException: [opni-bootstrap-0][10.0.12.157:9300][internal:cluster/coordination/join]
Caused by: java.lang.IllegalStateException: failure when sending a validation request to node
    at org.opensearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:634) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:312) ~[?:?]
    at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.4.0.jar:2.4.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [opni-data-0][10.0.148.66:9300][internal:cluster/coordination/join/validate]
Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid MrjDcoJPRQSAtKxTyeH8aQ than local cluster uuid 9C2sVJCdRueVQSnG81W-qQ, rejecting
    at org.opensearch.cluster.coordination.JoinHelper.lambda$new$4(JoinHelper.java:219) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceivedDecorate(SecuritySSLRequestHandler.java:192) ~[?:?]
    at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:278) ~[?:?]
    at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:152) ~[?:?]
    at org.opensearch.security.OpenSearchSecurityPlugin$7$1.messageReceived(OpenSearchSecurityPlugin.java:659) ~[?:?]
    at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:100) ~[?:?]
    at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) ~[?:?]
    at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.4.0.jar:2.4.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]
[2023-05-24T01:09:48,600][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [opni-data-0] Exception while retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
org.opensearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
    at org.opensearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:205) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:191) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.action.get.TransportMultiGetAction.doExecute(TransportMultiGetAction.java:81) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.action.get.TransportMultiGetAction.doExecute(TransportMultiGetAction.java:58) ~[opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218) [opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:118) [opensearch-index-management-2.4.0.0.jar:2.4.0.0]
    at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) [opensearch-2.4.0.jar:2.4.0]
    at org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:232) [opensearch-security-2.4.0.0.jar:2.4.0.0]

Additionally, the dashboards pod is stuck restarting and has the following repeated logs:

...
{"type":"log","@timestamp":"2023-05-24T01:09:12Z","tags": "error","opensearch","data"],"pid":1,"message":"[ResponseError]: Response Error"}
{"type":"log","@timestamp":"2023-05-24T01:09:14Z","tags":["error","opensearch","data"],"pid":1,"message":"[ResponseError]: Response Error"}
{"type":"log","@timestamp":"2023-05-24T01:09:17Z","tags":["error","opensearch","data"],"pid":1,"message":"[ResponseError]: Response Error"}
{"type":"log","@timestamp":"2023-05-24T01:09:19Z","tags":["error","opensearch","data"],"pid":1,"message":"[ResponseError]: Response Error"}
{"type":"log","@timestamp":"2023-05-24T01:09:20Z","tags":["info","plugins-system"],"pid":1,"message":"Stopping all plugins."}
{"type":"log","@timestamp":"2023-05-24T01:09:20Z","tags":["info","savedobjects-service"],"pid":1,"message":"Starting saved objects migrations"}
{"type":"log","@timestamp":"2023-05-24T01:09:20Z","tags":["warning","savedobjects-service"],"pid":1,"message":"Unable to connect to OpenSearch. Error: Given the configuration, the ConnectionPool was not able to find a usable Connection for this request."}
{"type":"log","@timestamp":"2023-05-24T01:14:29Z","tags":["info","plugins-service"],"pid":1,"message":"Plugin \"dataSourceManagement\" has been disabled since the following direct or transitive dependencies are missing or disabled: [dataSource]"}
{"type":"log","@timestamp":"2023-05-24T01:14:29Z","tags":["info","plugins-service"],"pid":1,"message":"Plugin \"dataSource\" is disabled."}
{"type":"log","@timestamp":"2023-05-24T01:14:29Z","tags":["info","plugins-service"],"pid":1,"message":"Plugin \"visTypeXy\" is disabled."}
{"type":"log","@timestamp":"2023-05-24T01:14:29Z","tags":["warning","config","deprecation"],"pid":1,"message":"\"cpu.cgroup.path.override\" is deprecated and has been replaced by \"ops.cGroupOverrides.cpuPath\""}
{"type":"log","@timestamp":"2023-05-24T01:14:29Z","tags":["warning","config","deprecation"],"pid":1,"message":"\"cpuacct.cgroup.path.override\" is deprecated and has been replaced by \"ops.cGroupOverrides.cpuAcctPath\""}
{"type":"log","@timestamp":"2023-05-24T01:14:29Z","tags":["info","plugins-system"],"pid":1,"message":"Setting up [49] plugins: [securityAnalyticsDashboards,alertingDashboards,usageCollection,opensearchDashboardsUsageCollection,opensearchDashboardsLegacy,mapsLegacy,share,opensearchUiShared,legacyExport,embeddable,expressions,data,home,console,apmOss,management,indexPatternManagement,advancedSettings,savedObjects,reportsDashboards,indexManagementDashboards,anomalyDetectionDashboards,dashboard,visualizations,visTypeVega,visTypeTimeline,timeline,visTypeTable,visTypeMarkdown,visBuilder,tileMap,regionMap,customImportMapDashboards,inputControlVis,ganttChartDashboards,visualize,searchRelevanceDashboards,queryWorkbenchDashboards,notificationsDashboards,charts,visTypeTimeseries,visTypeVislib,visTypeTagcloud,visTypeMetric,observabilityDashboards,discover,savedObjectsManagement,securityDashboards,bfetch]"}
{"type":"log","@timestamp":"2023-05-24T01:14:30Z","tags":["info","savedobjects-service"],"pid":1,"message":"Waiting until all OpenSearch nodes are compatible with OpenSearch Dashboards before starting saved objects migrations..."}
{"type":"log","@timestamp":"2023-05-24T01:14:30Z","tags":["error","opensearch","data"],"pid":1,"message":"[ResponseError]: Response Error"}
{"type":"log","@timestamp":"2023-05-24T01:14:30Z","tags":["error","savedobjects-service"],"pid":1,"message":"Unable to retrieve version information from OpenSearch nodes."}
{"type":"log","@timestamp":"2023-05-24T01:14:32Z","tags":["error","opensearch","data"],"pid":1,"message":"[ResponseError]: Response Error"}
{"type":"log","@timestamp":"2023-05-24T01:14:35Z","tags":["error","opensearch","data"],"pid":1,"message":"[ResponseError]: Response Error"}
{"type":"log","@timestamp":"2023-05-24T01:14:37Z","tags":["error","opensearch","data"],"pid":1,"message":"[ResponseError]: Response Error"}
...

dbason commented 1 year ago

@kralicky what actions were taken to put the cluster in this state?

Caused by: org.opensearch.transport.RemoteTransportException: [opni-data-0][10.0.148.66:9300][internal:cluster/coordination/join/validate]
Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid MrjDcoJPRQSAtKxTyeH8aQ than local cluster uuid 9C2sVJCdRueVQSnG81W-qQ, rejecting
    at org.opensearch.cluster.coordination.JoinHelper.lambda$new$4(JoinHelper.java:219) ~[opensearch-2.4.0.jar:2.4.0]

This error indicates a split brain has occurred. This can happen with the master data changes? Are you using local path provisioner as the storage class? This could potentially cause this problem if master nodes are restarting when they have a minority

kralicky commented 1 year ago

Looks like uninstalling logging, and then reinstalling it reproduces this

dbason commented 1 year ago

This is currently expected behaviour if the pvcs aren't removed after uninstall. There are a couple of possible approaches; we could provide an option to remove pvcs on uninstall. We could also look at not bootstrapping a new cluster if there are existing pvcs but this is potentially brittle.

kralicky commented 1 year ago

As a user I would expect that if I uninstall but don't delete persistent data, that I can then reinstall with the existing data. Conversely if I click the option to delete persistent data, it should actually delete the data.

dbason commented 1 year ago

For capabilities yes. For actually uninstalling a backend I think there's a different set of expectations.

kralicky commented 1 year ago

Would this change if log data was stored in s3 instead of locally?

rancher / opni

Opensearch data pod and dashboards stuck unready #1434