stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a- CrashLoopback #149

Closed rbo closed 7 months ago

rbo commented 7 months ago

Timeout in the pod, readinessProbe fail:

debug 2023-11-17T10:02:23.374+0000 7fad353a57c0 0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2023-11-17T10:02:23.374+0000 7fad353a57c0 0 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable), process radosgw, pid 1061
debug 2023-11-17T10:02:23.374+0000 7fad353a57c0 0 framework: beast
debug 2023-11-17T10:02:23.374+0000 7fad353a57c0 0 framework conf key: port, val: 8080
debug 2023-11-17T10:02:23.374+0000 7fad353a57c0 0 framework conf key: ssl_port, val: 443
debug 2023-11-17T10:02:23.374+0000 7fad353a57c0 0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem
debug 2023-11-17T10:02:23.374+0000 7fad353a57c0 0 framework conf key: ssl_private_key, val: /etc/ceph/private/rgw-key.pem
debug 2023-11-17T10:02:23.374+0000 7fad353a57c0 1 radosgw_Main not setting numa affinity
debug 2023-11-17T10:02:23.376+0000 7fad353a57c0 1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0
debug 2023-11-17T10:02:23.376+0000 7fad353a57c0 1 D3N datacache enabled: 0
debug 2023-11-17T10:10:51.372+0000 7f754d57a640 -1 Initialization timeout, failed to initialize

Working env.

oc rsh rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-57cd8fchmrpg
Defaulted container "rgw" out of: rgw, log-collector, chown-container-data-dir (init)
sh-5.1# curl http://0.0.0.0:8080
<?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyBucketsResult>sh-5.1# 
sh-5.1# 
sh-5.1# 

ISAR Cluster

$ oc  rsh rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-79f95f69d8tw
Defaulted container "rgw" out of: rgw, log-collector, chown-container-data-dir (init)
sh-5.1# curl http://0.0.0.0:8080
curl: (7) Failed to connect to 0.0.0.0 port 8080: Connection refused
sh-5.1# 
rbo commented 7 months ago

Let's enable ceph toolbox - https://access.redhat.com/articles/4628891

oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

oc -n openshift-storage get pod -l "app=rook-ceph-tools"
NAME                              READY   STATUS    RESTARTS   AGE
rook-ceph-tools-5bbc55fdf-cv7x2   1/1     Running   0          14s

oc rsh deployment/rook-ceph-tools
..
rbo commented 7 months ago
sh-5.1$ ceph health
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 137 pgs inactive; Degraded data redundancy: 169 pgs undersized

sh-5.1$ ceph status
  cluster:
    id:     7ec91a51-f0ee-40a6-8d93-5d5c30dc0d67
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 137 pgs inactive
            Degraded data redundancy: 169 pgs undersized

  services:
    mon: 3 daemons, quorum a,b,c (age 38h)
    mgr: a(active, since 38h)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 38h), 3 in (since 38h); 32 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 285 objects, 816 MiB
    usage:   1.9 GiB used, 1.3 TiB / 1.3 TiB avail
    pgs:     81.065% pgs not active
             570/855 objects misplaced (66.667%)
             137 undersized+peered
             32  active+undersized+remapped

  io:
    client:   12 KiB/s wr, 0 op/s rd, 0 op/s wr

  progress:
    Global Recovery Event (0s)
      [............................] 

sh-5.1$ ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         1.30980  root default                             
-3         1.30980      host inf44                           
 0    ssd  0.43660          osd.0       up   1.00000  1.00000
 1    ssd  0.43660          osd.1       up   1.00000  1.00000
 2    ssd  0.43660          osd.2       up   1.00000  1.00000
sh-5.1$ 
rbo commented 7 months ago

I only have 3 osd pods?

 oc get pods -o wide | grep osd 
rook-ceph-osd-0-7479486666-nzjqh                                  2/2     Running            0                38h   10.128.8.32    inf44                <none>           <none>
rook-ceph-osd-1-7f7b6f4bb5-jlv6l                                  2/2     Running            0                38h   10.128.8.35    inf44                <none>           <none>
rook-ceph-osd-2-5d4d48557-p2kjp                                   2/2     Running            0                38h   10.128.8.36    inf44                <none>           <none>
rook-ceph-osd-prepare-03d5d9dea68b6f8184c6b5545ce68586-vjfww      0/1     Completed          0                38h   10.128.8.29    inf44                <none>           <none>
rook-ceph-osd-prepare-095fc6277dd39c5d577393f1fe09f7ee-fwvcq      0/1     Completed          0                38h   10.128.16.31   inf7                 <none>           <none>
rook-ceph-osd-prepare-1113e506af934f35209a9ba2b63ec098-ffcdz      0/1     Completed          0                38h   10.131.8.30    inf8                 <none>           <none>
rook-ceph-osd-prepare-a607266483fda5b911a3dafbfef670e3-swh82      0/1     Completed          0                38h   10.128.8.30    inf44                <none>           <none>
rook-ceph-osd-prepare-c03093c6d966d9c7f13e419da2a780e9-jdsqq      0/1     Completed          0                38h   10.131.8.28    inf8                 <none>           <none>
rook-ceph-osd-prepare-c736908f8715093a69af797a5b38e6ae-2n6zh      0/1     Completed          0                38h   10.128.8.31    inf44                <none>           <none>
rook-ceph-osd-prepare-daf79088fdd9b1b15d2b2478c45155a7-rj9qr      0/1     Completed          0                38h   10.131.8.29    inf8                 <none>           <none>
rook-ceph-osd-prepare-e51b025211561bab8cb43e1af0f8111e-c6ll2      0/1     Completed          0                38h   10.128.16.30   inf7                 <none>           <none>
rook-ceph-osd-prepare-e834838f3706213e47a95a581231196f-nqhm8      0/1     Completed          0                38h   10.128.16.32   inf7                 <none>           <none>
 oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-7479486666-nzjqh   2/2     Running   0          38h   10.128.8.32   inf44   <none>           <none>
rook-ceph-osd-1-7f7b6f4bb5-jlv6`l   2/2     Running   0          38h   10.128.8.35   inf44   <none>           <none>
rook-ceph-osd-2-5d4d48557-p2kjp    2/2     Running   0          38h   10.128.8.36   inf44   <none>           <none>

=> Let's reinstall ODF

rbo commented 7 months ago

Reinstalled serverall times.

Solution:

  1. Uninstall local-storage operator -> clean all objects including storage class, pv, pvc and namespace
  2. Uninstall openshift data foundation -> clean all objects including storage class, pv, pvc and namespace
  3. Drain all storage nodes oc adm drain --ignore-daemonsets --delete-emptydir-data
  4. Clean all devices on all nodes with dd if=/dev/zero of=/dev/sdX bs=1024 count=1024000
  5. Remove local rook artifacts rm -rf /var/lib/rook/ on all storage nodes
  6. Remove local storage artifacts rm -rf /mnt/local-storage on all storage nodes
  7. Reboot the all storage nodes

ODF is running!