Closed immanuelfodor closed 3 years ago
A few moments ago, I increased the replica count to 3 to see if it helps (currently at 182 restarts), also seems to be better for a HA deployment :)
pod/piraeus-op-ha-controller-df776887b-bsptk 1/1 Running 0 4m24s
pod/piraeus-op-ha-controller-df776887b-j59ms 1/1 Running 182 4d19h
pod/piraeus-op-ha-controller-df776887b-zltkd 1/1 Running 0 4m24s
Well, it didn't help :grinning:
piraeus-op-ha-controller-df776887b-bsptk 1/1 Running 42 27h
piraeus-op-ha-controller-df776887b-j59ms 1/1 Running 223 5d22h
piraeus-op-ha-controller-df776887b-zltkd 1/1 Running 42 27h
Yeah, sorry about that. There seems to be a timeout on Kubernetes watches. I looked around similar projects and they seem to restart their watches every 15 minutes to work around this. I guess we will take the same approach.
I'll try it to manually update the chart value here to v0.1.2 until it is patched: https://github.com/piraeusdatastore/piraeus-operator/blob/master/charts/piraeus/values.yaml#L87
It seems to be solved, instead of 60+ restarts in the past 41h, only one has occurred since the deploy.
Last log before the restart (all three pods have similar):
time="2021-01-13T07:42:15Z" level=fatal msg="failed to run HA Controller" error="
error processing event: watch error: &Status{ListMeta:ListMeta{SelfLink:,Resource
Version:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:too old resour
ce version: 134128587 (134334168),Reason:Expired,Details:nil,Code:410,}"
It might be beneficial to know what was the event or resource that caused it.
Looks like in my tests I created not enough load on Pods/PVC/VolumeAttachements. So I didn't notice that I never actually received an updated ResourceVersion from Kubernetes. If the watch was restarted after enough time has passed, Kubernetes would complain that the given source revision is too old. That's the error you are receiving above. #6 should take care of that issue, too.
Thank you, I've just upgraded to the new version :)
After the piraeus-operator 1.3.0 release (https://github.com/piraeusdatastore/piraeus-operator/releases/tag/v1.3.0), I enabled the HA Controller in Helm as described. Since then the HA Controller has been restarted 150 times, which is seemingly high. A Nagios monitoring script is alerting after some 10s of pod restarts, so this alert would always fire after some time even if the pod is deleted to reset the restart counter. Please look into it if it is a bug and is fixable.
Here are the logs of the most recent restarts, I hope it helps:
Related Helm values: