OCP5 unresponsive, control plane seem to run havoc - Githubissues

stormshift / support

This repo should serve as a central source for reporting issues with stormshift

GNU General Public License v3.0

3 stars 0 forks source link

OCP5 unresponsive, control plane seem to run havoc #139

Closed DanielFroehlich closed 9 months ago

DanielFroehlich commented 12 months ago

console not reachable, 2 control nodes showing as 100% memory / CPU in RHEV:

also saw tons of IO on netapp from OCP5. moved disk to storm3 to offload netapp. still an issue. I

github-actions[bot] commented 12 months ago

Heads up @cluster/ocp5-admin - the "cluster/ocp5" label was applied to this issue.

DanielFroehlich commented 12 months ago

@rbo would you mind doing your magic?

rbo commented 12 months ago

ocp5-control-0, ocp5-control-1 do not responde via SSH

API is not available:

E0630 17:47:14.570131  662718 memcache.go:238] couldn't get current server API group list: Get "https://api.ocp5.stormshift.coe.muc.redhat.com:6443/api?timeout=32s": dial tcp 10.32.105.45:6443: connect: connection refused

Console looks good. => Reboot ocp5-control-0, ocp5-control-1 via rhev

rbo commented 12 months ago

/cc @stefan-bergstein

rbo commented 12 months ago

doubled cpu & memory of control plane nodes - reboot all and removed masterscheduleable

NAME                 STATUS   ROLES    AGE    VERSION
ocp5-control-0       Ready    master   370d   v1.25.8+37a9a08
ocp5-control-1       Ready    master   370d   v1.25.8+37a9a08
ocp5-control-2       Ready    master   370d   v1.25.8+37a9a08
ucs-blade-server-5   Ready    worker   37d    v1.25.8+37a9a08
ucs-blade-server-6   Ready    worker   37d    v1.25.8+37a9a08
ucs-blade-server-7   Ready    worker   142d   v1.25.8+37a9a08
ucs-blade-server-8   Ready    worker   143d   v1.25.8+37a9a08

rbo commented 12 months ago

Hope that stablelize a bit the controle plane

rbo commented 12 months ago

Let's wait a bit... give openshift the time to recover

rbo commented 12 months ago

Disk device /dev/nvme0n1 not accessible on host ocp5-control-0. - mh not good

rbo commented 12 months ago

Drain all the control plane node to remove the "worker" workload. Changed back the CPU & RAM it will be updated at the next reboot of the VM's.

Cluster looks good for me.

rbo commented 12 months ago

@stefan-bergstein please a look and close the ticket.

stefan-bergstein commented 12 months ago

Removing master scheduleable made ODF unscheduleable, because ODF must run on the control nodes, because the nvme are on the control nodes (VM with PCI passthrough):

Ensured that ocs label is set: oc get nodes | grep ocp5-control- | awk '{print $1} ' | while read f; do echo $f; oc label node $f cluster.ocs.openshift.io/openshift-storage="" ; done
Added taint: oc get nodes | grep ocp5-control- | awk '{print $1} ' | while read f; do echo $f; oc adm taint node $f node.ocs.openshift.io/storage="true":NoSchedule ; done
Made control nodes scheduleable: oc edit schedulers.config.openshift.io cluster -> mastersSchedulable: true

stefan-bergstein commented 12 months ago

ODF recovered, but control nodes with ODF should have 40GB (16+24) memory and 24 vCPU (16+8) each. Anyway, let's still try with limited resources because the should not be a lot of load on the cluster:

stefan-bergstein commented 12 months ago

@rbo please have look at my comments and let me know if this approach is okay.

stefan-bergstein commented 12 months ago

The taint on the control nodes was not that great. Removed it

  taints:
    - key: node.ocs.openshift.io/storage
      value: 'true'
      effect: NoSchedule

DanielFroehlich commented 9 months ago

OCP5 looks healthy, closing this issue