Open baznikin opened 3 years ago
@baznikin since in our tests a full k8s api restart always worked can you provide a simple reproducer of this issue so we can debug it and add a related test case?
Hmm, it's hard, this failure was unintentional. The only solution to recreate it is to run similar virtual machines in smaller scale on workstation and abort them. It was not restart, it was hard crash.
Is it possible to extract some information from configmap and correlate it with sentinel and keeper complains?
пт, 30 окт. 2020 г., 22:12 Simone Gotti notifications@github.com:
@baznikin https://github.com/baznikin since in our tests a full k8s api restart always worked can you provide a simple reproducer of this issue so we can debug it and add a related test case?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sorintlab/stolon/issues/806#issuecomment-719610910, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHROZXJU72T5A5E4JINA53SNLJUDANCNFSM4TE4M37Q .
I have encountered this problem, but it is accidental. I have been trying to reproduce this problem, but I have not ,what happened my servers reboot.
What happened: I have new baremetal k8s cluster on my desk. Due power failure it rebooted twice. I didn't check stolon status after first reboot. If it is important I can extract logs from Elastic. After second reboot stolon cluster didn't came up.
I was able to bring it up by deleting sentinel pods one by one. This behaviour is strange and have to be addressed.
Honestly, I have no idea why it break and why it recovered. Since I have few production instalations stolon I'm very interested to track source so you can fix it.
Helm vars
Image
127.0.0.1:5000/stolon
build with Dockerfile:Failure state
kubectl -n db get pod
stolon-keeper-0
stolon-keeper-1
For sentenels I got only last 3 log lines on my terminal, but if is important - I can extract startup messages from Elastic.
kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-bjpt7
kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-s2k68
kubectl -n db logs --tail=3 stolon-proxy-756cb878f-j6v9d
root@stolon-sentinel-85969d666d-s2k68:/#
stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap status
root@stolon-sentinel-85969d666d-s2k68:/#
stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap clusterdata read
kubectl -n db get configmaps stolon-cluster-stolon -o yaml
(blind) Recovery steppes
1) delete one sentinel
kubectl -n db delete pod stolon-sentinel-85969d666d-bjpt7
kubectl -n db logs stolon-sentinel-85969d666d-dtqc5
kubectl -n db logs stolon-sentinel-85969d666d-s2k68 --tail
(no changes)2) delete second sentinel
kubectl -n db delete pod stolon-sentinel-85969d666d-s2k68
kubectl -n db logs stolon-sentinel-85969d666d-pd2vx
Part logs of stolon-keeper-0
kubectl -n db get configmaps stolon-cluster-stolon -o yaml
What you expected to happen: Cluster should survive sudden reboot without looses.
How to reproduce it (as minimally and precisely as possible): Play around with power source ¯_(ツ)_/¯
Anything else we need to know?:
Environment: