hammerlab / secotrec

Setup Coclobas/Ketrew Clusters
Apache License 2.0
5 stars 6 forks source link

Workflows start failing when relatively unhealthier/old nodes don't pull new images #70

Open armish opened 7 years ago

armish commented 7 years ago

Perks of making heavy use multiple secotrec setups non-stop for almost a week: edge cases.

tl/dr: nodes in the cluster sometimes stops updating their images and start causing problems since they always try to run an older version of the biokepi-runner. Should we start versioning the images of thinking about modifying the imagePullPolicy setting on the cluster?

I was sometimes seeing a fraction of my jobs failing on certain nodes but not on the others (e.g.: bam2fq step always failing on gke-problematic-node). Although a clear and isolated problem, this is amazingly hard to put a finger on from the UI side since there it is just another step failing with no specifics to nodes/pods (unless you go and check them all).

As I was going down the list of failed parts and fixing issues one by one, I saw that one of them failed because GATK was complaining about the incompatible Java versions, which was weird because we updated both GATK and froze the one mutect uses, and this one was trying to use the new GATK against the old image. Some poking around revelaed that the node was >14 days old and it was still using the biokepi image dated back to its creation.

Removing the node solved the problem and I moved on since this is not something we always see but I just came across this:

https://kukulinski.com/10-most-common-reasons-kubernetes-deployments-fail-part-2#10containerimagenotupdating

links to the official kube doc:

Two things: 1) So this does keep happening to people, unlike what I thought at first. So we should address it before someone else starts pulling his hair debugging this. 2) Using the latest does make it more low-maintenance and I don't think we have to address it right away; but I do agree Kube's suggestion from a debugging point of view. Sometimes it really helps to have the previous image easily accessible and compare it to the new one, but right now the only way to do that is to roll back the keredofi repo and rebuild the image (which is not ideal).