stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

PodDisruptionBudgetLimit - open-cluster-management-observability/observability-thanos-receive-default #197

Closed rbo closed 2 months ago

rbo commented 2 months ago
~ % oc describe pdb/observability-thanos-receive-default
Name:             observability-thanos-receive-default
Namespace:        open-cluster-management-observability
Max unavailable:  1
Selector:         app.kubernetes.io/component=database-write-hashring,app.kubernetes.io/instance=observability,app.kubernetes.io/name=thanos-receive,app.kubernetes.io/part-of=observatorium,controller.receive.thanos.io/hashring=default
Status:
    Allowed disruptions:  0
    Current:              1
    Desired:              2
    Total:                3
Events:                   <none>
~ % oc get pods -l app.kubernetes.io/component=database-write-hashring,app.kubernetes.io/instance=observability,app.kubernetes.io/name=thanos-receive,app.kubernetes.io/part-of=observatorium,controller.receive.thanos.io/hashring=default
NAME                                     READY   STATUS              RESTARTS   AGE
observability-thanos-receive-default-0   1/1     Running             1          23d
observability-thanos-receive-default-1   0/1     ContainerCreating   0          2m40s
~ %
~ % oc describe pod observability-thanos-receive-default-1
Name:             observability-thanos-receive-default-1
Namespace:        open-cluster-management-observability
Priority:         0
Service Account:  observability-thanos-receive
Node:             ucs56/10.32.96.56
Start Time:       Thu, 29 Aug 2024 15:23:18 +0200
Labels:           app.kubernetes.io/component=database-write-hashring
                  app.kubernetes.io/instance=observability
                  app.kubernetes.io/name=thanos-receive
                  app.kubernetes.io/part-of=observatorium
                  app.kubernetes.io/version=v0.24.0
                  apps.kubernetes.io/pod-index=1
                  controller-revision-hash=observability-thanos-receive-default-7c8877fc54
                  controller.receive.thanos.io/hashring=default
                  statefulset.kubernetes.io/pod-name=observability-thanos-receive-default-1
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.129.13.209/21"],"mac_address":"0a:58:0a:81:0d:d1","gateway_ips":["10.129.8.1"],"routes":[{"dest":"10.128.0...
                  openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:
IPs:              <none>
Controlled By:    StatefulSet/observability-thanos-receive-default
Containers:
  thanos-receive:
    Container ID:
    Image:         registry.redhat.io/rhacm2/thanos-rhel9@sha256:9f85d747ef8c11a0e5c6612110adc7e8a180750057a33ef41554ae6f1de175b0
    Image ID:
    Ports:         10901/TCP, 10902/TCP, 19291/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      receive
      --log.level=info
      --log.format=logfmt
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --remote-write.address=0.0.0.0:19291
      --receive.replication-factor=3
      --tsdb.path=/var/thanos/receive
      --tsdb.retention=48h
      --label=replica="$(NAME)"
      --label=receive="true"
      --objstore.config=$(OBJSTORE_CONFIG)
      --receive.local-endpoint=$(NAME).observability-thanos-receive-default.$(NAMESPACE).svc.cluster.local:10901
      --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
      --tsdb.too-far-in-future.time-window=5m
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     300m
      memory:  2Gi
    Requests:
      cpu:      300m
      memory:   512Mi
    Liveness:   http-get http://:10902/-/healthy delay=0s timeout=1s period=30s #success=1 #failure=8
    Readiness:  http-get http://:10902/-/ready delay=0s timeout=1s period=5s #success=1 #failure=20
    Environment:
      NAME:             observability-thanos-receive-default-1 (v1:metadata.name)
      NAMESPACE:        open-cluster-management-observability (v1:metadata.namespace)
      HOST_IP_ADDRESS:   (v1:status.hostIP)
      OBJSTORE_CONFIG:  <set to the key 'thanos.yaml' in secret 'thanos-object-storage'>  Optional: false
    Mounts:
      /var/lib/thanos-receive from hashring-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6nvrw (ro)
      /var/thanos/receive from data (rw)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-observability-thanos-receive-default-1
    ReadOnly:   false
  hashring-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      observability-thanos-receive-controller-tenants-generated
    Optional:  false
  kube-api-access-6nvrw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason              Age    From                     Message
  ----     ------              ----   ----                     -------
  Normal   Scheduled           2m49s  default-scheduler        Successfully assigned open-cluster-management-observability/observability-thanos-receive-default-1 to ucs56
  Warning  FailedAttachVolume  2m49s  attachdetach-controller  Multi-Attach error for volume "pvc-23ed4021-866d-41c0-9eb6-1c0dc08610a5" Volume is already exclusively attached to one node and can't be attached to another
~ %
rbo commented 2 months ago
~ % oc describe pv pvc-23ed4021-866d-41c0-9eb6-1c0dc08610a5
Name:            pvc-23ed4021-866d-41c0-9eb6-1c0dc08610a5
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: csi.trident.netapp.io
                 volume.kubernetes.io/provisioner-deletion-secret-name:
                 volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers:      [kubernetes.io/pv-protection external-attacher/csi-trident-netapp-io]
StorageClass:    coe-netapp-san
Status:          Bound
Claim:           open-cluster-management-observability/data-observability-thanos-receive-default-1
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        10Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            csi.trident.netapp.io
    FSType:            ext4
    VolumeHandle:      pvc-23ed4021-866d-41c0-9eb6-1c0dc08610a5
    ReadOnly:          false
    VolumeAttributes:      backendUUID=cd7b267e-7ff1-42ff-a2b5-617216ba06ea
                           internalName=isar_pvc_23ed4021_866d_41c0_9eb6_1c0dc08610a5
                           name=pvc-23ed4021-866d-41c0-9eb6-1c0dc08610a5
                           protocol=block
                           storage.kubernetes.io/csiProvisionerIdentity=1716842609293-2854-csi.trident.netapp.io
Events:                <none>
~ %

~ % ssh -l admin netapp-mgmt.coe.muc.redhat.com lun mapping show  | grep isar_pvc_23ed4021_866d_41c0_9eb6_1c0dc08610a5
Warning: Permanently added 'netapp-mgmt.coe.muc.redhat.com' (ED25519) to the list of known hosts.
svm_trident /vol/trident_lun_pool_isar_ZTIFFWGFPK/isar_pvc_23ed4021_866d_41c0_9eb6_1c0dc08610a5  ucs-blade-server-3-0f228f1d-6034-47c8-b456-0e13c65e964c  7  iscsi
~ %

PVC is mapped to ucs-blade-server-3

rbo commented 2 months ago

Let's try to drain and reboot ucs-blade-server-3

rbo commented 2 months ago

Fixed:

~ % oc get pods -l app.kubernetes.io/component=database-write-hashring,app.kubernetes.io/instance=observability,app.kubernetes.io/name=thanos-receive,app.kubernetes.io/part-of=observatorium,controller.receive.thanos.io/hashring=default -o wide
NAME                                     READY   STATUS    RESTARTS   AGE   IP              NODE     NOMINATED NODE   READINESS GATES
observability-thanos-receive-default-0   1/1     Running   0          46s   10.130.12.152   ucs57    <none>           <none>
observability-thanos-receive-default-1   1/1     Running   0          11m   10.129.13.209   ucs56    <none>           <none>
observability-thanos-receive-default-2   1/1     Running   0          97s   10.128.24.51    ceph12   <none>           <none>
~ %