giantswarm / kvm-operator

Handles Kubernetes clusters running on a Kubernetes cluster with workers and masters in KVMs on bare metal
https://godoc.org/github.com/giantswarm/kvm-operator
Apache License 2.0
89 stars 13 forks source link

kvm-operator: recover from failures or just die #165

Closed calvix closed 7 years ago

calvix commented 7 years ago

After etcd broke on lycan and k8s was throwing errors, kvm-operator stuck and even after the etcd state was recovered it was unable to recover and had to be restarted (by killing pod)

I would expect that it can recover when all other components are fine.

Problem with this is that kvm-operator was running and it doesn't look like there was any problem, it just silently failed and not doing anything else anymore.

last error message

{"caller":"github.com/giantswarm/kvm-operator/service/operator/service.go:107","error":"[{/go/src/github.com/giantswarm/kvm-operator/vendor/github.com/giantswarm/operatorkit/tpr/tpr.go:207: creating TPR kvm.cluster.giantswarm.io} {/go/src/github.com/giantswarm/kvm-operator/vendor/github.com/giantswarm/operatorkit/tpr/tpr.go:271: creating TPR kvm.cluster.giantswarm.io} {etcdserver: mvcc: database space exceeded}]","time":"17-09-13 10:10:56.616"}
xh3b4sd commented 7 years ago

I take this. This is also relevant for our other operators and microservices that boot.