OCP4 Network Problems - Githubissues

Javatar81 commented 2 years ago

We seem to have network problems with compute nodes

Connectivity outage detected: network-check-target-gpu: failed to establish a TCP connection to 10.130.2.12:8080: dial tcp 10.130.2.12:8080: connect: connection refused

Connectivity restored after 9m0.420050088s: network-check-target-gpu: tcp connection to 10.130.2.7:8080 succeeded

Leads to infinitely terminating pods on the nodes.

github-actions[bot] commented 2 years ago

Heads up @cluster/ocp4-admin - the "cluster/ocp4" label was applied to this issue.

DanielFroehlich commented 2 years ago

can you ping the nodes from the bastion host?

DanielFroehlich commented 2 years ago

Storm3 hypervisor host is close to overload:

DanielFroehlich commented 2 years ago

I am shutting down ocp1 cluster to free up some ressources on the hypervisor level

DanielFroehlich commented 2 years ago

compute1+2 are in a strange state - I will initiate reboot them now

rbo commented 2 years ago

Please be careful with the OCS.

DanielFroehlich commented 2 years ago

compute0 looks good, so we should be fine. We probably lost quorum already, but I will reboot them sequentially, not parallel. Thanks for reminding me!

Somehing is running amok - can ssh into node from bastion, ping is working. gracefull shutdown is also not working. Last time I have seen this it was some elastic search stuff consuming all memory. RHEV shows cpu 40% mem 43%

DanielFroehlich commented 2 years ago

We have an infra issue here. The worker nodes VMS have 16 cores, all nodes are pinned to storm3 phys host due to ocs. That host has 36 phys cores. So we are overcommited on RHEV AND OpenShift. Even in "idle" mode, each node is consuming around 10 cores, which brings us close to the total available. If somethings goes wild, we are in trouble and see an instable cluster

Options: a: remove workload. Best candidate: LOGGING - is that really needed?

b: change the VM config from 16 cores to 10 cores, to avoid overloading the hypervisor and also let users know when we are above limits.

c: use NVME disks from storm2 and/or storm6 to better distribute the worker nodes accross phys hosts (that is actually an artefact from a time where we had only one phys host). That should actually be quite easy, as they are just pci passthroughed, and OCS will re-create the data if we move one host at time

I let @Javatar81 as cluster admin decide on what to do, actually I think all of the above would make sense.

Javatar81 commented 2 years ago

I would like to add d as a fourth option: delete ocp3 and add the freed resources in the form of additional nodes to ocp4. Robert had this idea some weeks ago and I think we do not need two permanent clusters up and running. Wha do you think?

DanielFroehlich commented 2 years ago

Hm, I think we have a couple of demo / use cases where we do want to have workload spread across several clusters. So we would need two more or less stable clusters in the long term.

DanielFroehlich commented 2 years ago

cluster seems to be stable currenlty, so I am closing this issue for now. Let re-start the discussion when needed.

stormshift / support

OCP4 Network Problems #53