Closed rgaudin closed 1 year ago
Another occurrence ; we should consider it could be a side effect of the new node setup as it started soon after. Maybe a node election issue for instance
@rgaudin might that be we had a new occurence one hour ago?
So I finally had the opportunity to diagnose a bit during an outage:
stats
node (either matomo-app-deployment
and a special debug-matomo-network-tool
I started for debugging ; or more exactly, they are very unstable (i.e. they work "sometimes"). Issue is communications error to 10.32.0.10#53: timed out
services
and storage
node with other special debug-matomo-network-tool
pods I started for debuggingsystem
node started yesterday (I don't remember if I did it yesterday or not)I recreated the bastion
/ system
node from scratch, yesterday operation was not successful, IP change on new VM (to reuse the "bastion" IP and not change DNS) was not reflected properly in k8s.
Just to recap we had 14 occurrences in less than 24 following the new bastion setup. ATM last occurrence was at 07:39 UTC. Longest time between occurrences was ~3h.
No more issue until now, seems to be very linked to bastion
node issues.
That been said, the link between both issues is not fully understood (especially why it did not had impact on other pods, at least not observable).
We continue to monitor the situation and will close the issue of no news on Monday.
Closing since the problem did not happened again.
@Popolechien @kelson42 FYI, this outage caused an issue on Matomo statistics for "download.kiwix.org" and "library.kiwix.org" (at least, probably some more minor impact on other statistis). Recorded visits are lower than reality since we miss some data due to the outage. We discussed about it with @rgaudin and consider it is better to keep it "as-is", since reworking missed data is going to be complex + potentially cause some duplicate visits which we consider worse than missed visit (especially since we miss only one single day).
Today (2023-08-31), three times, uptime reported a 503 on stats.kiwix.org. Actually it was twice when I started writing this ticket and another one arrived π
I could test the service at the very same moment and saw the issue: a k8s-internal DNS resolution issue from the app (PHP) to the SQL service (sts).
Doesn't lasts long and self resolves.
coredns service is running (single pod on services) and has no Pod Error nor Pod Restart.
Considering this might be a network issue similar to #111, I restarted the app pod at 17:07.