Open coleshaw opened 2 years ago
Okay, just did another production deploy, and getting Prometheus errors:
[FIRING:1] (Monitored Instance(s) not responding 1 10.0.4.186:3000 etna critical)
It appears that the docker daemon on dsco1
may not be responding? I wonder if our deploy process somehow overloads the docker daemons, because it tries restarting a lot of containers at the same time?
Hm, also note that the service agents_pruner
has not run on dsco1
correctly since 9/16 -- failed with shutdown
status, and attempted to run again on 9/18
but stuck on starting
. I wonder if the system doesn't get cleaned up correctly, and then gets stuck?
Running the pruner more frequently (12h), to see if that helps. Also restarted docker
on dsco1
...
Just scanning through the logs of the completed pruner jobs, it seems like some nodes do wind up with a lot of images / containers to clean (anywhere from 0GB to 3.5GB to 8.8GB), even when running every 12 hours. Many seem to be archimedes, archimedes-node, polyphemus ones. Perhaps all the Airflow run_on_docker
jobs, plus the archimedes ones? Maybe if there were too many at the 24hr mark, they would freeze the prune job?
There also seems to be an unbalanced distribution of said ephemeral jobs, so perhaps that leads to the overloaded nodes hanging?
Okay, looks like Magma was down all weekend, and similarly looks like vulcan-node
's docker daemon may have frozen since then. The clean-up task on 10/1 got stuck:
Errors on the server for docker
daemon are:
t 03 04:58:37 vulcan-node dockerd[30124]: time="2022-10-03T04:58:37.412905793-07:00" level=error msg="Failed to join memberlist [192.168.2.198 192.168.2.159] on retry:
Oct 03 04:59:10 vulcan-node dockerd[30124]: time="2022-10-03T04:59:10.735372427-07:00" level=warning msg="NetworkDB stats vulcan-node(39c1ad69adad) - healthscore:7 (conn
Oct 03 04:59:28 vulcan-node dockerd[30124]: time="2022-10-03T04:59:28.339058994-07:00" level=error msg="Failed to join memberlist [192.168.2.198 192.168.2.159] on retry:
Oct 03 04:59:29 vulcan-node dockerd[30124]: time="2022-10-03T04:59:29.339013826-07:00" level=error msg="Failed to join memberlist [192.168.2.198 192.168.2.159] on retry:
Oct 03 04:59:30 vulcan-node dockerd[30124]: time="2022-10-03T04:59:30.344490524-07:00" level=error msg="Failed to join memberlist [192.168.2.198 192.168.2.159] on retry: 2 errors occurred:\n\t* Failed to join 192.168.2.198: No installed keys could decrypt the message\n\t* Failed to join 192.168.2.159: No installed keys could decrypt th
Note that the IP address given for 159 is medusa, 198 is monitor / prometheus ... so this node is somehow not able to join the swarm or contact the swarm managers? Will restart the node and investigate the keys issue, maybe can better pinpoint the root cause ..
Hm, and it also looks like medusa
switched its swarm cert / key on Oct 1:
Oct 1 23:00 swarm-node.crt
Even though it's not currently the leader (prometheus-dsco is). I wonder if it's related to leader rotations? Something like this GH issue?
A little more context around why I want to cleanly separate out our manager and worker nodes, is this section of the swarm docs:
By default manager nodes also act as a worker nodes. This means the scheduler can assign tasks to a manager node. For small and non-critical swarms assigning tasks to managers is relatively low-risk as long as you schedule services using resource constraints for cpu and memory.
However, because manager nodes use the Raft consensus algorithm to replicate data in a consistent way, they are sensitive to resource starvation. You should isolate managers in your swarm from processes that might block swarm operations like swarm heartbeat or leader elections.
To avoid interference with manager node operation, you can drain manager nodes to make them unavailable as worker nodes:
docker node update --availability drain
When you drain a node, the scheduler reassigns any tasks running on the node to other available worker nodes in the swarm. It also prevents the scheduler from assigning tasks to the node.
So seems like we should do our best to prevent resource starvation. I wonder if many jobs get assigned to dsco1
because it has a lot of resources, and that somehow interferes with the manager functionality. :shrug:
Hm, dsco2
's turn to hang?
l=error msg="logs call failed" error="failed getting container logs: No such container: DockerSwarmOperator_d4c024d860a490301552900b68f01acc.1.rwzzo818ktctnbu5wcfdnfxkr
Seeing the same issues on dsco2
, which is causing Janus to fail (see Slack) -- janus_app_fe
is running on dsco2
right now.
07:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=192.168.2.198:47262"
07:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=192.168.2.200:60326"
07:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=192.168.2.193:54516"
New manager node is set up, when I am back on Tuesday I will re-assign manager roles and rebalance the nodes / set up proper constraints on the stacks ... from the above message, seems like the node could not connect itself to the managers ... ?
Hm, so on dsco3
, just had a single magma_app_fe
container get stuck. Cannot kill the container via docker on the node itself. Could spin up another instance in Portainer, and it came back up on a different node ... forced to restart docker on dsco3 to clear out the stuck container.
dsco3
seemed stuck from oct 21 to oct 24 ... had to restart the docker process again...
Just keeping running notes on what seems to be weird, when
docker
seems to hang on some nodes ...Incident Sept 16th, 11:12am Pacific
docker
ondsco1
anddsco2
portainer_agent
onprometheus
(plusdsco1
anddsco2
via the above)Noticed two
portainer_agent
containers running onprometheus
...lot ofoverlay2
directories when checkingdf
ondsco2
but didn't seem excessive...Timing-wise, had just done a production deploy (#1039) -- related?