merklecounty / rget

download URLs and verify the contents against a publicly recorded cryptographic log
https://merklecounty.com
Apache License 2.0
205 stars 17 forks source link

regular outages and missing logging #30

Closed philips closed 5 years ago

philips commented 5 years ago

The nodes powering the backend keep dying every few hours causing an outage of 1-5 minutes 1 to 3 times a day. This needs to be investigated including a few steps:

philips commented 5 years ago

Alright, figured out why the logs are missing. The new Stackdriver setup for GKE calls the logs "Kubernetes Container" logs not "GKE Container" logs...

philips commented 5 years ago

Nothing really jumped out in the prometheus logs. But, I made some graphs in Stackdriver to hopefully get a better view into what is happening on the next outage.

philips commented 5 years ago

There were a lot of piled up oncall-issue-filer processes on the cluster. Reduced the frequency of the cronjob by 5x: https://github.com/philips/oncall-issue-filer/issues/8

philips commented 5 years ago

gah, I think it was because I made the machines preemptible vms... testing this out.

philips commented 5 years ago

Alright, no more outages. I am just silly and used preemptimble VMs.