Closed philips closed 5 years ago
Alright, figured out why the logs are missing. The new Stackdriver setup for GKE calls the logs "Kubernetes Container" logs not "GKE Container" logs...
Nothing really jumped out in the prometheus logs. But, I made some graphs in Stackdriver to hopefully get a better view into what is happening on the next outage.
There were a lot of piled up oncall-issue-filer processes on the cluster. Reduced the frequency of the cronjob by 5x: https://github.com/philips/oncall-issue-filer/issues/8
gah, I think it was because I made the machines preemptible vms... testing this out.
Alright, no more outages. I am just silly and used preemptimble VMs.
The nodes powering the backend keep dying every few hours causing an outage of 1-5 minutes 1 to 3 times a day. This needs to be investigated including a few steps: