node instability - Githubissues

coleshaw commented 2 years ago

Just keeping running notes on what seems to be weird, when docker seems to hang on some nodes ...

Incident Sept 16th, 11:12am Pacific

Restarted docker on dsco1 and dsco2
Restarted portainer_agent on prometheus (plus dsco1 and dsco2 via the above)

Noticed two portainer_agent containers running on prometheus...lot of overlay2 directories when checking df on dsco2 but didn't seem excessive...

Timing-wise, had just done a production deploy (#1039) -- related?

coleshaw commented 2 years ago

Okay, just did another production deploy, and getting Prometheus errors:

[FIRING:1]  (Monitored Instance(s) not responding 1 10.0.4.186:3000 etna critical)

It appears that the docker daemon on dsco1 may not be responding? I wonder if our deploy process somehow overloads the docker daemons, because it tries restarting a lot of containers at the same time?

Hm, also note that the service agents_pruner has not run on dsco1 correctly since 9/16 -- failed with shutdown status, and attempted to run again on 9/18 but stuck on starting. I wonder if the system doesn't get cleaned up correctly, and then gets stuck?

coleshaw commented 2 years ago

Running the pruner more frequently (12h), to see if that helps. Also restarted docker on dsco1 ...

coleshaw commented 2 years ago

Just scanning through the logs of the completed pruner jobs, it seems like some nodes do wind up with a lot of images / containers to clean (anywhere from 0GB to 3.5GB to 8.8GB), even when running every 12 hours. Many seem to be archimedes, archimedes-node, polyphemus ones. Perhaps all the Airflow run_on_docker jobs, plus the archimedes ones? Maybe if there were too many at the 24hr mark, they would freeze the prune job?

There also seems to be an unbalanced distribution of said ephemeral jobs, so perhaps that leads to the overloaded nodes hanging?

coleshaw commented 2 years ago

Okay, looks like Magma was down all weekend, and similarly looks like vulcan-node's docker daemon may have frozen since then. The clean-up task on 10/1 got stuck:

Screenshot from 2022-10-03 07-59-12

Errors on the server for docker daemon are:

t 03 04:58:37 vulcan-node dockerd[30124]: time="2022-10-03T04:58:37.412905793-07:00" level=error msg="Failed to join memberlist [192.168.2.198 192.168.2.159] on retry:
Oct 03 04:59:10 vulcan-node dockerd[30124]: time="2022-10-03T04:59:10.735372427-07:00" level=warning msg="NetworkDB stats vulcan-node(39c1ad69adad) - healthscore:7 (conn
Oct 03 04:59:28 vulcan-node dockerd[30124]: time="2022-10-03T04:59:28.339058994-07:00" level=error msg="Failed to join memberlist [192.168.2.198 192.168.2.159] on retry:
Oct 03 04:59:29 vulcan-node dockerd[30124]: time="2022-10-03T04:59:29.339013826-07:00" level=error msg="Failed to join memberlist [192.168.2.198 192.168.2.159] on retry:
Oct 03 04:59:30 vulcan-node dockerd[30124]: time="2022-10-03T04:59:30.344490524-07:00" level=error msg="Failed to join memberlist [192.168.2.198 192.168.2.159] on retry: 2 errors occurred:\n\t* Failed to join 192.168.2.198: No installed keys could decrypt the message\n\t* Failed to join 192.168.2.159: No installed keys could decrypt th

Note that the IP address given for 159 is medusa, 198 is monitor / prometheus ... so this node is somehow not able to join the swarm or contact the swarm managers? Will restart the node and investigate the keys issue, maybe can better pinpoint the root cause ..

coleshaw commented 2 years ago

Hm, and it also looks like medusa switched its swarm cert / key on Oct 1:

Oct  1 23:00 swarm-node.crt

Even though it's not currently the leader (prometheus-dsco is). I wonder if it's related to leader rotations? Something like this GH issue?

coleshaw commented 2 years ago

A little more context around why I want to cleanly separate out our manager and worker nodes, is this section of the swarm docs:

By default manager nodes also act as a worker nodes. This means the scheduler can assign tasks to a manager node. For small and non-critical swarms assigning tasks to managers is relatively low-risk as long as you schedule services using resource constraints for cpu and memory.

However, because manager nodes use the Raft consensus algorithm to replicate data in a consistent way, they are sensitive to resource starvation. You should isolate managers in your swarm from processes that might block swarm operations like swarm heartbeat or leader elections.

To avoid interference with manager node operation, you can drain manager nodes to make them unavailable as worker nodes:

docker node update --availability drain

When you drain a node, the scheduler reassigns any tasks running on the node to other available worker nodes in the swarm. It also prevents the scheduler from assigning tasks to the node.

So seems like we should do our best to prevent resource starvation. I wonder if many jobs get assigned to dsco1 because it has a lot of resources, and that somehow interferes with the manager functionality. :shrug:

coleshaw commented 2 years ago

Hm, dsco2's turn to hang? Screenshot from 2022-10-06 13-03-06

coleshaw commented 2 years ago

l=error msg="logs call failed" error="failed getting container logs: No such container: DockerSwarmOperator_d4c024d860a490301552900b68f01acc.1.rwzzo818ktctnbu5wcfdnfxkr

coleshaw commented 2 years ago

Seeing the same issues on dsco2, which is causing Janus to fail (see Slack) -- janus_app_fe is running on dsco2 right now.

07:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=192.168.2.198:47262"
07:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=192.168.2.200:60326"
07:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=192.168.2.193:54516"

New manager node is set up, when I am back on Tuesday I will re-assign manager roles and rebalance the nodes / set up proper constraints on the stacks ... from the above message, seems like the node could not connect itself to the managers ... ?

coleshaw commented 2 years ago

Hm, so on dsco3, just had a single magma_app_fe container get stuck. Cannot kill the container via docker on the node itself. Could spin up another instance in Portainer, and it came back up on a different node ... forced to restart docker on dsco3 to clear out the stuck container.

coleshaw commented 2 years ago

dsco3 seemed stuck from oct 21 to oct 24 ... had to restart the docker process again...

mountetna / monoetna

node instability #1040