stolostron / multicluster-observability-operator

Operator for Multi-Cluster Monitoring with Thanos.
Apache License 2.0
125 stars 69 forks source link

Internal error occurred: error resolving resource when running on plain k8s #1193

Open david-martin opened 1 year ago

david-martin commented 1 year ago

I see the README is geared at the hub cluster being Openshift, however I also see this wording that suggests may this can work on plain k8s as well?

Note: By default, the API conversion webhook use on the OpenShift service serving certificate feature to manage the certificate, you can replace it with cert-manager if you want to run the multicluster-observability-operator in a kubernetes cluster.

I'm using local kind clusters with k8s v1.26.0 I've gotten as far as the below command but hitting an error. I'm thinking it could be related to the webhook?

kubectl -n open-cluster-management-observability apply -f operators/multiclusterobservability/config/samples/observability_v1beta2_multiclusterobservability.yaml

Error from server (InternalError): error when retrieving current configuration of:
Resource: "observability.open-cluster-management.io/v1beta2, Resource=multiclusterobservabilities", GroupVersionKind: "observability.open-cluster-management.io/v1beta2, Kind=MultiClusterObservability"
Name: "observability", Namespace: ""
from server for: "operators/multiclusterobservability/config/samples/observability_v1beta2_multiclusterobservability.yaml": Internal error occurred: error resolving resource

Is there a way I can get this add-on to work with plain k8s?

david-martin commented 1 year ago

Looks like a cert issue alright

kubectl -n open-cluster-management describe po multicluster-observability-operator-77446bdd89-xp4fm| grep -A 6 Events
Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    26m                   default-scheduler  Successfully assigned open-cluster-management/multicluster-observability-operator-77446bdd89-xp4fm to ocm-cluster-1-control-plane
  Warning  FailedMount  24m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[cert], unattached volumes=[kube-api-access-w8kh6 cert]: timed out waiting for the condition
  Warning  FailedMount  4m1s (x9 over 22m)    kubelet            Unable to attach or mount volumes: unmounted volumes=[cert], unattached volumes=[cert kube-api-access-w8kh6]: timed out waiting for the condition
  Warning  FailedMount  3m56s (x19 over 26m)  kubelet            MountVolume.SetUp failed for volume "cert" : secret "multicluster-observability-operator-webhook-server-cert" not found

And I can see the openshift annotation on the service.

kubectl -n open-cluster-management get svc multicluster-observability-webhook-service -o yaml | grep "serving-cert-secret-name"
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"service.beta.openshift.io/serving-cert-secret-name":"multicluster-observability-operator-webhook-server-cert"},"labels":{"name":"multicluster-observability-operator"},"name":"multicluster-observability-webhook-service","namespace":"open-cluster-management"},"spec":{"ports":[{"port":443,"protocol":"TCP","targetPort":9443}],"selector":{"name":"multicluster-observability-operator"}}}
    service.beta.openshift.io/serving-cert-secret-name: multicluster-observability-operator-webhook-server-cert

I wonder what steps are required to get cert-manager to inject the cert in the right place

david-martin commented 1 year ago

I've gotten a little further with the help of cert-manager. Here's some commands I used:

Install cert-manager

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml

Create a Certificate for the webhook service

cat <<EOF | kubectl apply -f - 
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: multicluster-observability-operator-issuer
  namespace: open-cluster-management
spec:
  selfSigned: {}
EOF
cat <<EOF | kubectl apply -f - 
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: multicluster-observability-operator-webhook-server-cert
  namespace: open-cluster-management
spec:
  dnsNames:
    - multicluster-observability-webhook-service.open-cluster-management.svc
  secretName: multicluster-observability-operator-webhook-server-cert
  issuerRef:
    name: multicluster-observability-operator-issuer
EOF

Add cert-manager inject annotation to CRD & ValidatingWebhookConfiguration.

kubectl annotate crd multiclusterobservabilities.observability.open-cluster-management.io cert-manager.io/inject-ca-from=open-cluster-management/multicluster-observability-operator-webhook-server-cert
kubectl annotate ValidatingWebhookConfiguration multicluster-observability-operator cert-manager.io/inject-ca-from=open-cluster-management/multicluster-observability-operator-webhook-server-cert

Now I can list and create MultiClusterObservability CRs. When I do that, creating the example from the repo at operators/multiclusterobservability/config/samples/observability_v1beta2_multiclusterobservability.yaml, I'm seeing a new set of problems.

kubectl -n open-cluster-management-observability get pod
NAME                                                    READY   STATUS              RESTARTS   AGE
minio-59b76b4cd-4r6kc                                   0/1     Pending             0          4m22s
observability-alertmanager-0                            0/3     ContainerCreating   0          2m38s
observability-grafana-855b85957d-7fglt                  0/3     ContainerCreating   0          2m40s
observability-grafana-855b85957d-ssh4g                  0/3     ContainerCreating   0          2m40s
observability-observatorium-operator-79f8fc5fc8-5xzzq   0/1     ImagePullBackOff    0          2m40s

The minio problem is:

  Warning  FailedScheduling  86s (x5 over 8m24s)  default-scheduler  0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

Alertmanager:

  Normal   Scheduled    7m37s                 default-scheduler  Successfully assigned open-cluster-management-observability/observability-alertmanager-0 to ocm-cluster-1-control-plane
  Warning  FailedMount  3m19s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[tls-secret], unattached volumes=[kube-api-access-9src9 tls-secret alertmanager-proxy config-volume alertmanager-db]: timed out waiting for the condition
  Warning  FailedMount  85s (x11 over 7m37s)  kubelet            MountVolume.SetUp failed for volume "tls-secret" : secret "alertmanager-tls" not found
  Warning  FailedMount  64s (x2 over 5m34s)   kubelet            Unable to attach or mount volumes: unmounted volumes=[tls-secret], unattached volumes=[config-volume alertmanager-db kube-api-access-9src9 tls-secret alertmanager-proxy]: timed out waiting for the condition

Grafana 1:

  Normal   Scheduled    8m6s                  default-scheduler  Successfully assigned open-cluster-management-observability/observability-grafana-855b85957d-7fglt to ocm-cluster-1-control-plane
  Warning  FailedMount  7m2s (x8 over 8m5s)   kubelet            MountVolume.SetUp failed for volume "cookie-secret" : secret "rbac-proxy-cookie-secret" not found
  Warning  FailedMount  7m2s (x8 over 8m5s)   kubelet            MountVolume.SetUp failed for volume "tls-secret" : secret "grafana-tls" not found
  Warning  FailedMount  6m3s                  kubelet            Unable to attach or mount volumes: unmounted volumes=[grafana-datasources tls-secret cookie-secret], unattached volumes=[grafana-datasources grafana-config kube-api-access-9m4l8 tls-secret cookie-secret grafana-storage]: timed out waiting for the condition
  Warning  FailedMount  114s (x11 over 8m5s)  kubelet            MountVolume.SetUp failed for volume "grafana-datasources" : secret "grafana-datasources" not found

Grafana 2:

  Normal   Scheduled    8m43s                   default-scheduler  Successfully assigned open-cluster-management-observability/observability-grafana-855b85957d-ssh4g to ocm-cluster-1-control-plane
  Warning  FailedMount  7m39s (x8 over 8m42s)   kubelet            MountVolume.SetUp failed for volume "grafana-datasources" : secret "grafana-datasources" not found
  Warning  FailedMount  7m39s (x8 over 8m42s)   kubelet            MountVolume.SetUp failed for volume "cookie-secret" : secret "rbac-proxy-cookie-secret" not found
  Warning  FailedMount  6m40s                   kubelet            Unable to attach or mount volumes: unmounted volumes=[grafana-datasources tls-secret cookie-secret], unattached volumes=[grafana-datasources grafana-config kube-api-access-2jv7b tls-secret cookie-secret grafana-storage]: timed out waiting for the condition
  Warning  FailedMount  2m31s (x11 over 8m42s)  kubelet            MountVolume.SetUp failed for volume "tls-secret" : secret "grafana-tls" not found

The ImagePullBackOff is due to a missing image tag in quay:

  Normal   BackOff    68s (x20 over 6m20s)   kubelet            Back-off pulling image "quay.io/stolostron/observatorium-operator:2.4.0-SNAPSHOT-2021-09-23-07-02-14"