Closed Javatar81 closed 2 years ago
Heads up @cluster/ocp4-admin - the "cluster/ocp4" label was applied to this issue.
can you ping the nodes from the bastion host?
Storm3 hypervisor host is close to overload:
I am shutting down ocp1 cluster to free up some ressources on the hypervisor level
compute1+2 are in a strange state - I will initiate reboot them now
Please be careful with the OCS.
compute0 looks good, so we should be fine. We probably lost quorum already, but I will reboot them sequentially, not parallel. Thanks for reminding me!
Somehing is running amok - can ssh into node from bastion, ping is working. gracefull shutdown is also not working. Last time I have seen this it was some elastic search stuff consuming all memory. RHEV shows cpu 40% mem 43%
We have an infra issue here. The worker nodes VMS have 16 cores, all nodes are pinned to storm3 phys host due to ocs. That host has 36 phys cores. So we are overcommited on RHEV AND OpenShift. Even in "idle" mode, each node is consuming around 10 cores, which brings us close to the total available. If somethings goes wild, we are in trouble and see an instable cluster
Options: a: remove workload. Best candidate: LOGGING - is that really needed?
b: change the VM config from 16 cores to 10 cores, to avoid overloading the hypervisor and also let users know when we are above limits.
c: use NVME disks from storm2 and/or storm6 to better distribute the worker nodes accross phys hosts (that is actually an artefact from a time where we had only one phys host). That should actually be quite easy, as they are just pci passthroughed, and OCS will re-create the data if we move one host at time
I let @Javatar81 as cluster admin decide on what to do, actually I think all of the above would make sense.
I would like to add d as a fourth option: delete ocp3 and add the freed resources in the form of additional nodes to ocp4. Robert had this idea some weeks ago and I think we do not need two permanent clusters up and running. Wha do you think?
Hm, I think we have a couple of demo / use cases where we do want to have workload spread across several clusters. So we would need two more or less stable clusters in the long term.
cluster seems to be stable currenlty, so I am closing this issue for now. Let re-start the discussion when needed.
We seem to have network problems with compute nodes
Leads to infinitely terminating pods on the nodes.