k8s dns issue - Githubissues

kiwix / operations

Kiwix Kubernetes Cluster

http://charts.k8s.kiwix.org/

6 stars 0 forks source link

k8s dns issue #113

Closed rgaudin closed 1 year ago

rgaudin commented 1 year ago

Today (2023-08-31), three times, uptime reported a 503 on stats.kiwix.org. Actually it was twice when I started writing this ticket and another one arrived 😔

I could test the service at the very same moment and saw the issue: a k8s-internal DNS resolution issue from the app (PHP) to the SQL service (sts).

100.64.7.228 -  31/Aug/2023:17:07:15 +0000 "POST /matomo.php" 200                                                                                                                                                                                  │
│ NOTICE: PHP message: [stats.kiwix.org] Error in Matomo: Could not connect to the database:  SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo for matomo-db-service failed: Temporary failure in name resolution  This may be a tempora │
│ 100.64.7.228 -  31/Aug/2023:17:06:57 +0000 "GET /index.php" 500

Screenshot 2023-08-31 at 16 54 17

Doesn't lasts long and self resolves.

coredns service is running (single pod on services) and has no Pod Error nor Pod Restart.

Considering this might be a network issue similar to #111, I restarted the app pod at 17:07.

rgaudin commented 1 year ago

Another occurrence ; we should consider it could be a side effect of the new node setup as it started soon after. Maybe a node election issue for instance

kelson42 commented 1 year ago

@rgaudin might that be we had a new occurence one hour ago?

benoit74 commented 1 year ago

So I finally had the opportunity to diagnose a bit during an outage:

dig/nslookup of matomo-db-service are failing on the stats node (either matomo-app-deployment and a special debug-matomo-network-tool I started for debugging ; or more exactly, they are very unstable (i.e. they work "sometimes"). Issue is communications error to 10.32.0.10#53: timed out
dig/nslookup of matomo-db-service are ok on the services and storage node with other special debug-matomo-network-tool pods I started for debugging
don't know wether it is linked or not, but this morning it is impossible to get a shell on a pod of the new system node started yesterday (I don't remember if I did it yesterday or not)

benoit74 commented 1 year ago

I recreated the bastion / system node from scratch, yesterday operation was not successful, IP change on new VM (to reuse the "bastion" IP and not change DNS) was not reflected properly in k8s.

rgaudin commented 1 year ago

Just to recap we had 14 occurrences in less than 24 following the new bastion setup. ATM last occurrence was at 07:39 UTC. Longest time between occurrences was ~3h.

benoit74 commented 1 year ago

No more issue until now, seems to be very linked to bastion node issues. That been said, the link between both issues is not fully understood (especially why it did not had impact on other pods, at least not observable). We continue to monitor the situation and will close the issue of no news on Monday.

benoit74 commented 1 year ago

Closing since the problem did not happened again.

benoit74 commented 1 year ago

@Popolechien @kelson42 FYI, this outage caused an issue on Matomo statistics for "download.kiwix.org" and "library.kiwix.org" (at least, probably some more minor impact on other statistis). Recorded visits are lower than reality since we miss some data due to the outage. We discussed about it with @rgaudin and consider it is better to keep it "as-is", since reworking missed data is going to be complex + potentially cause some duplicate visits which we consider worse than missed visit (especially since we miss only one single day).