Closed stefan-bergstein closed 1 year ago
Heads up @cluster/ocp5-admin - the "cluster/ocp5" label was applied to this issue.
cant see the etcd error any more. Maybe that was due to a storage migration from NFS on storm3 to NFS on CoeApp, which I did using live migration in that time frame.
OCS: Mons are crashlooping with
debug 2023-01-09T10:30:09.469+0000 7fb341df1700 -1 rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 7233105621517887860 in /var/lib/ceph/mon/ceph-a/store.db/368171.sst
debug 2023-01-09T10:30:09.469+0000 7fb341df1700 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-a': (22) Invalid argument
OSD's are also crashlooping, osd not starting, can see any logs. strange. Trying a reboot.. ... no help.
--> Storage SME needed, or re-deploy ODF (dont forget to wipefs the NVMes if you do so)
Storage SME (matm) assesment: ```OK, am 18.12.2022 08:34 Systemzeit von ocp5-control-2 ist irgendwas passiert, dass osd.0 gekillt wurde. Danach haben die anderen slow requests bis sie selber starben. mon-c hat eine Corruption im sst file der DB um: 2022-12-18T10:00:3 mon-a wird 2022-12-18T10:59:13.825+0000 gekillt.
Die Mons a und c haben korrupte sst files und koennen damit nicht starten. Damit bekommt mon-d kein quorum hin. Der Cluster ist also tot.````
--> mon-a and mon-c data files manually deleted, mon-pods also deleted --> system recovers from there by its own.
control-1 etcd member also crashlooping. Logs show corrupt etcd database. Manually deleted the etcd control-1 data filesystem. from there, followed procedure as docuemented in https://access.redhat.com/solutions/6962106
Seems to be happy again:
@stefan-bergstein , please check and close issue if you are also happy again
OCP5 looks healthy, closing this issue
Data Foundation Degraded. Many ods are failing:
etcd on cronrole node 1 is failing: