Severe storage issues on OCP5 (ODF and etcd)

stefan-bergstein commented 1 year ago

Data Foundation Degraded. Many ods are failing:

odf-operator-controller-manager-6f66945ccc-bkkhg                  1/2     CrashLoopBackOff              1359 (66s ago)     5d21h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6fb76b56bmvx8   0/2     Error                         0                  5d21h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6fb76b56lhr6v   1/2     CrashLoopBackOff              1461 (4m37s ago)   3d17h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-749f4bd5dfxtv   1/2     CrashLoopBackOff              15 (100s ago)      46m
rook-ceph-mgr-a-7c9b988547-jcc4f                                  1/2     CrashLoopBackOff              2308 (93s ago)     5d21h
rook-ceph-mon-a-84fd9c688b-cgzxz                                  1/2     CrashLoopBackOff              1664 (3m16s ago)   5d21h
rook-ceph-mon-c-6bb997bb59-h8lnv                                  1/2     CrashLoopBackOff              2267 (109s ago)    32d
rook-ceph-osd-0-8c787c959-fg58g                                   1/2     CrashLoopBackOff              2091 (3m50s ago)   32d
rook-ceph-osd-1-c9c6f86f4-lfllv                                   1/2     CrashLoopBackOff              2045 (3m29s ago)   32d
rook-ceph-osd-2-c7cc75fdb-khnlw                                   1/2     CrashLoopBackOff              2050 (60s ago)     17d

etcd on cronrole node 1 is failing:

panic: freepages: failed to get all reachable pages (page 4065394140482466913: out of bounds: 189068)
goroutine 92 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2(0xc0000140c0)
/remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1056 +0xe9
created by go.etcd.io/bbolt.(*DB).freepages
/remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1054 +0x1cd

github-actions[bot] commented 1 year ago

Heads up @cluster/ocp5-admin - the "cluster/ocp5" label was applied to this issue.

DanielFroehlich commented 1 year ago

cant see the etcd error any more. Maybe that was due to a storage migration from NFS on storm3 to NFS on CoeApp, which I did using live migration in that time frame.

OCS: Mons are crashlooping with

debug 2023-01-09T10:30:09.469+0000 7fb341df1700 -1 rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 7233105621517887860 in /var/lib/ceph/mon/ceph-a/store.db/368171.sst
debug 2023-01-09T10:30:09.469+0000 7fb341df1700 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-a': (22) Invalid argument

OSD's are also crashlooping, osd not starting, can see any logs. strange. Trying a reboot.. ... no help.

--> Storage SME needed, or re-deploy ODF (dont forget to wipefs the NVMes if you do so)

DanielFroehlich commented 1 year ago

Storage SME (matm) assesment: ```OK, am 18.12.2022 08:34 Systemzeit von ocp5-control-2 ist irgendwas passiert, dass osd.0 gekillt wurde. Danach haben die anderen slow requests bis sie selber starben. mon-c hat eine Corruption im sst file der DB um: 2022-12-18T10:00:3 mon-a wird 2022-12-18T10:59:13.825+0000 gekillt.

Die Mons a und c haben korrupte sst files und koennen damit nicht starten. Damit bekommt mon-d kein quorum hin. Der Cluster ist also tot.````

--> mon-a and mon-c data files manually deleted, mon-pods also deleted --> system recovers from there by its own.

control-1 etcd member also crashlooping. Logs show corrupt etcd database. Manually deleted the etcd control-1 data filesystem. from there, followed procedure as docuemented in https://access.redhat.com/solutions/6962106

Seems to be happy again:

DanielFroehlich commented 1 year ago

@stefan-bergstein , please check and close issue if you are also happy again

DanielFroehlich commented 1 year ago

OCP5 looks healthy, closing this issue

stormshift / support

Severe storage issues on OCP5 (ODF and etcd) #117