piraeusdatastore / piraeus-ha-controller

High Availability Controller for stateful workloads using storage provisioned by Piraeus
Apache License 2.0
15 stars 8 forks source link

HA Controller restarting too many times #2

Closed immanuelfodor closed 3 years ago

immanuelfodor commented 3 years ago

After the piraeus-operator 1.3.0 release (https://github.com/piraeusdatastore/piraeus-operator/releases/tag/v1.3.0), I enabled the HA Controller in Helm as described. Since then the HA Controller has been restarted 150 times, which is seemingly high. A Nagios monitoring script is alerting after some 10s of pod restarts, so this alert would always fire after some time even if the pod is deleted to reset the restart counter. Please look into it if it is a bug and is fixable.

$ k get po
NAME                                         READY   STATUS    RESTARTS   AGE
piraeus-op-cs-controller-5db495d656-gnkv5    1/1     Running   4          5d
piraeus-op-csi-controller-6ccd9fbc44-cgw49   6/6     Running   0          3d22h
piraeus-op-csi-node-5j2f2                    3/3     Running   0          3d22h
piraeus-op-csi-node-7rbsr                    3/3     Running   0          3d22h
piraeus-op-csi-node-hdqhx                    3/3     Running   0          3d22h
piraeus-op-etcd-0                            1/1     Running   3          5d
piraeus-op-etcd-1                            1/1     Running   3          5d
piraeus-op-etcd-2                            1/1     Running   3          5d
piraeus-op-ha-controller-df776887b-j59ms     1/1     Running   150        3d22h
piraeus-op-ns-node-4whrm                     1/1     Running   1          4d18h
piraeus-op-ns-node-b4zpq                     1/1     Running   1          4d18h
piraeus-op-ns-node-pv429                     1/1     Running   1          4d18h
piraeus-op-operator-7466ddd49c-h776t         1/1     Running   4          5d

Here are the logs of the most recent restarts, I hope it helps:

time="2020-12-30T09:31:28Z" level=info msg="starting piraeus-ha-controller" version=v0.1.1                                                                                                                                                
I1230 09:31:28.799096       1 leaderelection.go:243] attempting to acquire leader lease  piraeus/piraeus-ha-controller...                                                                                                                 
I1230 09:31:28.870384       1 leaderelection.go:253] successfully acquired lease piraeus/piraeus-ha-controller                                                                                                                            
time="2020-12-30T09:31:28Z" level=info msg="new leader" leader=piraeus-op-ha-controller-df776887b-j59ms                                                                                                                                   
time="2020-12-30T09:31:28Z" level=info msg="gained leader status"                                                                                                                                                                         
time="2020-12-30T10:03:12Z" level=fatal msg="failed to run HA Controller" error="pvc updates closed unexpectedly"                                                                                                                         

time="2020-12-30T10:03:12Z" level=info msg="starting piraeus-ha-controller" version=v0.1.1                                                                                                                                                
I1230 10:03:12.539253       1 leaderelection.go:243] attempting to acquire leader lease  piraeus/piraeus-ha-controller...                                                                                                                 
I1230 10:03:12.578347       1 leaderelection.go:253] successfully acquired lease piraeus/piraeus-ha-controller                                                                                                                            
time="2020-12-30T10:03:12Z" level=info msg="gained leader status"                                                                                                                                                                         
time="2020-12-30T10:03:12Z" level=info msg="new leader" leader=piraeus-op-ha-controller-df776887b-j59ms                                                                                                                                   
time="2020-12-30T10:40:24Z" level=fatal msg="failed to run HA Controller" error="pvc updates closed unexpectedly"                                                                                                                         

time="2020-12-30T10:40:25Z" level=info msg="starting piraeus-ha-controller" version=v0.1.1                                                                                                                                                
I1230 10:40:25.837975       1 leaderelection.go:243] attempting to acquire leader lease  piraeus/piraeus-ha-controller...                                                                                                                 
I1230 10:40:25.870270       1 leaderelection.go:253] successfully acquired lease piraeus/piraeus-ha-controller                                                                                                                            
time="2020-12-30T10:40:25Z" level=info msg="new leader" leader=piraeus-op-ha-controller-df776887b-j59ms                                                                                                                                   
time="2020-12-30T10:40:25Z" level=info msg="gained leader status"

Related Helm values:

haController:
  enabled: true
  image: quay.io/piraeusdatastore/piraeus-ha-controller:v0.1.1
  affinity: {}
  tolerations: []
  resources:
    limits:
      cpu: "0.2"
      memory: "250Mi"
    requests:
      cpu: "0.1"
      memory: "100Mi"
  replicas: 1
immanuelfodor commented 3 years ago

A few moments ago, I increased the replica count to 3 to see if it helps (currently at 182 restarts), also seems to be better for a HA deployment :)

pod/piraeus-op-ha-controller-df776887b-bsptk     1/1     Running   0          4m24s
pod/piraeus-op-ha-controller-df776887b-j59ms     1/1     Running   182        4d19h
pod/piraeus-op-ha-controller-df776887b-zltkd     1/1     Running   0          4m24s
immanuelfodor commented 3 years ago

Well, it didn't help :grinning:

piraeus-op-ha-controller-df776887b-bsptk     1/1     Running   42         27h
piraeus-op-ha-controller-df776887b-j59ms     1/1     Running   223        5d22h
piraeus-op-ha-controller-df776887b-zltkd     1/1     Running   42         27h
WanzenBug commented 3 years ago

Yeah, sorry about that. There seems to be a timeout on Kubernetes watches. I looked around similar projects and they seem to restart their watches every 15 minutes to work around this. I guess we will take the same approach.

immanuelfodor commented 3 years ago

I'll try it to manually update the chart value here to v0.1.2 until it is patched: https://github.com/piraeusdatastore/piraeus-operator/blob/master/charts/piraeus/values.yaml#L87

immanuelfodor commented 3 years ago

It seems to be solved, instead of 60+ restarts in the past 41h, only one has occurred since the deploy.

Last log before the restart (all three pods have similar):

 time="2021-01-13T07:42:15Z" level=fatal msg="failed to run HA Controller" error="
 error processing event: watch error: &Status{ListMeta:ListMeta{SelfLink:,Resource
 Version:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:too old resour
 ce version: 134128587 (134334168),Reason:Expired,Details:nil,Code:410,}"

It might be beneficial to know what was the event or resource that caused it.

WanzenBug commented 3 years ago

Looks like in my tests I created not enough load on Pods/PVC/VolumeAttachements. So I didn't notice that I never actually received an updated ResourceVersion from Kubernetes. If the watch was restarted after enough time has passed, Kubernetes would complain that the given source revision is too old. That's the error you are receiving above. #6 should take care of that issue, too.

immanuelfodor commented 3 years ago

Thank you, I've just upgraded to the new version :)