Closed DanielFroehlich closed 9 months ago
Heads up @cluster/ocp5-admin - the "cluster/ocp5" label was applied to this issue.
@rbo would you mind doing your magic?
ocp5-control-0, ocp5-control-1 do not responde via SSH
API is not available:
E0630 17:47:14.570131 662718 memcache.go:238] couldn't get current server API group list: Get "https://api.ocp5.stormshift.coe.muc.redhat.com:6443/api?timeout=32s": dial tcp 10.32.105.45:6443: connect: connection refused
Console looks good. => Reboot ocp5-control-0, ocp5-control-1 via rhev
/cc @stefan-bergstein
doubled cpu & memory of control plane nodes - reboot all and removed masterscheduleable
NAME STATUS ROLES AGE VERSION
ocp5-control-0 Ready master 370d v1.25.8+37a9a08
ocp5-control-1 Ready master 370d v1.25.8+37a9a08
ocp5-control-2 Ready master 370d v1.25.8+37a9a08
ucs-blade-server-5 Ready worker 37d v1.25.8+37a9a08
ucs-blade-server-6 Ready worker 37d v1.25.8+37a9a08
ucs-blade-server-7 Ready worker 142d v1.25.8+37a9a08
ucs-blade-server-8 Ready worker 143d v1.25.8+37a9a08
Hope that stablelize a bit the controle plane
Let's wait a bit... give openshift the time to recover
Disk device /dev/nvme0n1 not accessible on host ocp5-control-0. - mh not good
Drain all the control plane node to remove the "worker" workload. Changed back the CPU & RAM it will be updated at the next reboot of the VM's.
Cluster looks good for me.
@stefan-bergstein please a look and close the ticket.
Removing master scheduleable made ODF unscheduleable, because ODF must run on the control nodes, because the nvme are on the control nodes (VM with PCI passthrough):
oc get nodes | grep ocp5-control- | awk '{print $1} ' | while read f; do echo $f; oc label node $f cluster.ocs.openshift.io/openshift-storage="" ; done
oc get nodes | grep ocp5-control- | awk '{print $1} ' | while read f; do echo $f; oc adm taint node $f node.ocs.openshift.io/storage="true":NoSchedule ; done
oc edit schedulers.config.openshift.io cluster
-> mastersSchedulable: true
ODF recovered, but control nodes with ODF should have 40GB (16+24) memory and 24 vCPU (16+8) each.
Anyway, let's still try with limited resources because the should not be a lot of load on the cluster:
@rbo please have look at my comments and let me know if this approach is okay.
The taint on the control nodes was not that great. Removed it
taints:
- key: node.ocs.openshift.io/storage
value: 'true'
effect: NoSchedule
OCP5 looks healthy, closing this issue
console not reachable, 2 control nodes showing as 100% memory / CPU in RHEV:
also saw tons of IO on netapp from OCP5. moved disk to storm3 to offload netapp. still an issue. I