rancher / opni

Multi Cluster Observability with AIOps
https://opni.io
Apache License 2.0
338 stars 53 forks source link

Logging : auth setup failure #1279

Open alexandreLamarre opened 1 year ago

alexandreLamarre commented 1 year ago

Observed behaviour

the opni-dashboards are unable to access open search, due to invalid auth setup

[2023-04-04T18:29:25,393][WARN ][o.o.s.a.BackendRegistry  ] [opni-data-1] Authentication finally failed for internalopni from 10.0.5.179:50322

Expected behaviour

the opni-dashboards are able to access open search

Steps to reproduce

config-primary

dashboard config-primary

Attempted fix

alexandreLamarre commented 1 year ago

Probably related to https://github.com/rancher/opni/issues/1183

dbason commented 1 year ago

Can we please collect logs from the manager pod.

alexandreLamarre commented 1 year ago

There are no relevant error logs from the manager pod

alexandreLamarre commented 1 year ago

All resources are created even the secret containing the internal opni user, but the setup is erroneous somehow because that user fails to authorize

dbason commented 1 year ago

The only other thing I can think of is to try enabling persistent storage. This may be related to that as I tend to always use it.

alexandreLamarre commented 1 year ago

Yeah I tried installing it with persistent storage as well, and the issue persisted

sanjay920 commented 1 year ago

Attaching the bootstrap logs. Looks like there's relevant info there opni-bootstrap-0_opensearch.log

alexandreLamarre commented 1 year ago

Found some TLS handshake errors in cert manager webhook

0404 19:13:28.384262       1 logs.go:59] http: TLS handshake error from 10.0.17.245:58904: read tcp 10.0.17.241:10250->10.0.17.245:58904: read: connection reset by peer
I0404 19:13:28.434664       1 logs.go:59] http: TLS handshake error from 10.0.17.245:58916: EOF
I0404 19:13:28.444426       1 logs.go:59] http: TLS handshake error from 10.0.17.245:58932: read tcp 10.0.17.241:10250->10.0.17.245:58932: read: connection reset by peer
I0404 19:13:28.462671       1 logs.go:59] http: TLS handshake error from 10.0.17.245:58946: read tcp 10.0.17.241:10250->10.0.17.245:58946: read: connection reset by peer

originating from one of the dashboards pod.

Cert manager pod also complaining about its resource management:


I0404 23:06:01.731192       1 controller.go:162] cert-manager/certificates-readiness "msg"="re-queuing item due to optimistic locking on resource" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"opensearch-opni-internalopni\": the object has been modified; please apply your changes to the latest version and try again" "key"="opni/opensearch-opni-internalopni"
I0404 23:06:01.731537       1 conditions.go:192] Found status change for Certificate "opensearch-opni-internalopni" condition "Ready": "False" -> "True"; setting lastTransitionTime to 2023-04-04 23:06:01.731528865 +0000 UTC m=+14652.919641752
I0404 23:06:01.741083       1 controller.go:162] cert-manager/certificates-readiness "msg"="re-queuing item due to optimistic locking on resource" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"opensearch-opni-admin\": the object has been modified; please apply your changes to the latest version and try again" "key"="opni/opensearch-opni-admin"
I0404 23:06:01.741344       1 conditions.go:192] Found status change for Certificate "opensearch-opni-admin" condition "Ready": "False" -> "True"; setting lastTransitionTime to 2023-04-04 23:06:01.741336694 +0000 UTC m=+14652.929449578
I0404 23:06:01.770689       1 controller.go:162] cert-manager/certificates-readiness "msg"="re-queuing item due to optimistic locking on resource" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"opensearch-opni-internalopni\": the object has been modified; please apply your changes to the latest version and try again" "key"="opni/opensearch-opni-internalopni"
I0404 23:06:01.770995       1 conditions.go:192] Found status change for Certificate "opensearch-opni-internalopni" condition "Ready": "False" -> "True"; setting lastTransitionTime to 2023-04-04 23:06:01.770987664 +0000 UTC m=+14652.959100552
I0404 23:06:01.806932       1 controller.go:162] cert-manager/certificates-key-manager "msg"="re-queuing item due to optimistic locking on resource" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"opensearch-opni-opni-indexer\": the object has been modified; please apply your changes to the latest version and try again" "key"="opni/opensearch-opni-opni-indexer"
alexandreLamarre commented 1 year ago
[23:25:09] INFO tracing feature enabled: false {"controller": "multiclusterrolebinding", "controllerGroup": "logging.opni.io", "controllerKind": "MulticlusterRoleBinding", "MulticlusterRoleBinding": {"name":"opni","namespace":"opni"}, "namespace": "opni", "name": "opni", "reconcileID": "6a6a8b28-6046-4ebc-8772-a1dde8656a40"}
[23:25:09] ERROR Reconciler error {"controller": "multiclusterrolebinding", "controllerGroup": "logging.opni.io", "controllerKind": "MulticlusterRoleBinding", "MulticlusterRoleBinding": {"name":"opni","namespace":"opni"}, "namespace": "opni", "name": "opni", "reconcileID": "6a6a8b28-6046-4ebc-8772-a1dde8656a40", "error": "failed to create rolesmapping: [500 Internal Server Error] {\"status\":\"INTERNAL_SERVER_ERROR\",\"message\":\"Security index not initialized\"}"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235

opni logging.opni.io.Multicluster role binding :

status:
  state: Error
spec:
  opensearch:
    name: opni
    namespace: opni
  opensearchConfig:
    indexRetention: 7d