rancher / opni

Multi Cluster Observability with AIOps
https://opni.io
Apache License 2.0
337 stars 53 forks source link

Cortex is in pending state without starting workloads #1735

Open alexandreLamarre opened 1 year ago

alexandreLamarre commented 1 year ago

steps to reproduce

expected

UI

Backend

Not classified

opni-manager logs

[20:46:48] ERROR monitoring failed to reconcile monitoring cluster {"gateway": "opni", "namespace": "opni", "error": "Gateway.core.opni.io \"\" not found"}
github.com/rancher/opni/controllers.(*CoreMonitoringReconciler).Reconcile
    github.com/rancher/opni/controllers/core_monitoring_controller.go:86
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226
[20:46:48] ERROR Reconciler error {"controller": "monitoringcluster", "controllerGroup": "core.opni.io", "controllerKind": "MonitoringCluster", "MonitoringCluster": {"name":"opni","namespace":"opni"}, "namespace": "opni", "name": "opni", "reconcileID": "6bd50549-e075-4975-b56c-e912bc095992", "error": "Gateway.core.opni.io \"\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226

monitoring cluster CRD

spec:
  cortex:
   // .... (this is correct)
    cortexWorkloads:
      targets:
        all:
          replicas: 1
    enabled: true
  gateway: {}
  grafana:
    config: {}
    dashboardContentCacheDuration: 0s
    enabled: true
    hostname: // ... (this is correct)

gateway pod ENV

// ...
GATEWAY_NAME=opni-gateway
POD_NAMESPACE=opni
alexandreLamarre commented 1 year ago

Deleting the CRD and attempting to re-install via the UI results in the call:

http://localhost:12080/opni-api/CortexOps/configuration/default


(nats: wrong last sequence: 1: key exists6
(type.googleapis.com/google.rpc.ErrorInfo

CONFLICT
alexandreLamarre commented 1 year ago

there is something weird going on with the gateway ref, after I fixed it manually and got cortex to all-in-one to install, when switching to HA preset mode in the UI, it edited the CRD to have:

gateway:
  name : opni-gateway

but did not include the namespace

alexandreLamarre commented 1 year ago

this also did not switch cortex to HA preset:

The status indicates it thinks it should be all in one and has deleted the HA setup

status:
  cortex:
    version: v1.16.0-opni.8
    workloadStatus:
      alertmanager:
        conditions: StatefulSet has been successfully deleted
        ready: true
      all:
        conditions: All replicas are ready
        ready: true
      compactor:
        conditions: StatefulSet has been successfully deleted
        ready: true
      distributor:
        conditions: Deployment has been successfully deleted
        ready: true
      ingester:
        conditions: StatefulSet has been successfully deleted
        ready: true
      purger:
        conditions: Deployment has been successfully deleted
        ready: true
      querier:
        conditions: StatefulSet has been successfully deleted
        ready: true
      query-frontend:
        conditions: Deployment has been successfully deleted
        ready: true
      ruler:
        conditions: Deployment has been successfully deleted
        ready: true
      store-gateway:
        conditions: StatefulSet has been successfully deleted
        ready: true
    workloadsReady: true
  image: >-
    alex7285/opni@sha256:6180a4e04fe1b310b02c437766fbe20bf3702304e5cebf38a797140647d46435
  imagePullPolicy: IfNotPresent
spec:
  cortex:
    cortexConfig:
      limits:
        compactor_blocks_retention_period: {}
      log_level: debug
      storage:
        backend: s3
        filesystem: {}
        s3:
        // ....
    cortexWorkloads:
      targets:
        all:
          replicas: 1
    enabled: true
  gateway:
    name: opni-gateway
    namespace: opni
  grafana:
    config: {}
    dashboardContentCacheDuration: 0s
    enabled: true
alexandreLamarre commented 1 year ago

Trying to then edit the config in the UI, results in the following bug: invalid

alexandreLamarre commented 1 year ago

When the HA configuration was accepted by the UI, it was not applied to the backend:

{
    "enabled": true,
    "revision": {
        "revision": "56399193"
    },
    "cortexWorkloads": {
        "targets": {
            "all": {
                "replicas": 1
            }
        }
    },
    "cortexConfig": {
        "limits": {
            "compactorBlocksRetentionPeriod": "0s"
        },
        "storage": {
            "backend": "s3",
            "s3": {
                "endpoint": "s3.us-east-1.amazonaws.com",
                "region": "us-east-1",
                "secretAccessKey": "***",
                "accessKeyId": "AKIARHLSZXXGKCKBHQVX",
                "sse": {},
                "http": {}
            },
            "filesystem": {}
        },
        "logLevel": "debug"
    },
    "grafana": {
        "enabled": true,
        "hostname": "//...
    }
}
alexandreLamarre commented 1 year ago

This is also tracks the UI failures and expected behavior for the UI