regular outages and missing logging

merklecounty / rget

download URLs and verify the contents against a publicly recorded cryptographic log

https://merklecounty.com

Apache License 2.0

205 stars 17 forks source link

regular outages and missing logging #30

Closed philips closed 5 years ago

philips commented 5 years ago

The nodes powering the backend keep dying every few hours causing an outage of 1-5 minutes 1 to 3 times a day. This needs to be investigated including a few steps:

[x] Figure out why log aggregation stopped working August 1st
[x] Dig through the prometheus metrics to see if it is an OOM or something else in the code
[ ] Wait for the next reboot and hopefully have tracebacks or something
[ ] Put limits on the containers so they can't take down a whole node
[ ] Fix it.

philips commented 5 years ago

Alright, figured out why the logs are missing. The new Stackdriver setup for GKE calls the logs "Kubernetes Container" logs not "GKE Container" logs...

philips commented 5 years ago

Nothing really jumped out in the prometheus logs. But, I made some graphs in Stackdriver to hopefully get a better view into what is happening on the next outage.

philips commented 5 years ago

There were a lot of piled up oncall-issue-filer processes on the cluster. Reduced the frequency of the cronjob by 5x: https://github.com/philips/oncall-issue-filer/issues/8

philips commented 5 years ago

gah, I think it was because I made the machines preemptible vms... testing this out.

philips commented 5 years ago

Alright, no more outages. I am just silly and used preemptimble VMs.