kvm-operator: recover from failures or just die

giantswarm / kvm-operator

Handles Kubernetes clusters running on a Kubernetes cluster with workers and masters in KVMs on bare metal

Apache License 2.0

89 stars 13 forks source link

After etcd broke on lycan and k8s was throwing errors, kvm-operator stuck and even after the etcd state was recovered it was unable to recover and had to be restarted (by killing pod)

I would expect that it can recover when all other components are fine.

Problem with this is that kvm-operator was running and it doesn't look like there was any problem, it just silently failed and not doing anything else anymore.

last error message

{"caller":"github.com/giantswarm/kvm-operator/service/operator/service.go:107","error":"[{/go/src/github.com/giantswarm/kvm-operator/vendor/github.com/giantswarm/operatorkit/tpr/tpr.go:207: creating TPR kvm.cluster.giantswarm.io} {/go/src/github.com/giantswarm/kvm-operator/vendor/github.com/giantswarm/operatorkit/tpr/tpr.go:271: creating TPR kvm.cluster.giantswarm.io} {etcdserver: mvcc: database space exceeded}]","time":"17-09-13 10:10:56.616"}

giantswarm / kvm-operator

kvm-operator: recover from failures or just die #165