getaddrinfo ENOTFOUND occasionally

sudoexec commented 3 months ago

📑 I have found these related issues/pull requests

🛡️ Security Policy

[X] I agree to have read this project Security Policy

Description

There are some getaddrinfo ENOTFOUND errors occasionally(0-3 errors per day).

Uptime Kuma running in k8s. Upstream dns is k8s's coredns and coredns don't have any error logs. I use while true; do nslookup example.com && sleep 1; done to test dns resolution and no errors.

The error occurs randomly and I can't reproduce it. Is there any methods to find details about this error?

👟 Reproduction steps

Can't reproduce.

👀 Expected behavior

No getaddrinfo ENOTFOUND errors.

😓 Actual Behavior

getaddrinfo ENOTFOUND

🐻 Uptime-Kuma Version

1.23.11

💻 Operating System and Arch

k8s

🌐 Browser

125.0.6422.112 (Official Build) Arch Linux (64-bit)

🖥️ Deployment Environment

Runtime: k8s v1.18.1
Database: sqlite
Filesystem used to store the database on: local storage via hostpath
number of monitors: 52

📝 Relevant log output

Failing: getaddrinfo ENOTFOUND

CommanderStorm commented 3 months ago

Same steps as in https://github.com/louislam/uptime-kuma/issues/4765

getaddrinfo ENOTFOUND test.xyz

What is the TTL of the domains you are using?
Do you have DNS caching enabled in the settings?

Most commonly, this issue is caused by you using a DNS resolver which does not like the level of DNS requests it is getting. => your DNS Server is dropping SOME requests => have you enabled NSCD in the settings to lowered the amount of DNS requests to your TTL (instead of on every request)

sudoexec commented 3 months ago

Same steps as in #4765

getaddrinfo ENOTFOUND test.xyz

What is the TTL of the domains you are using?

Do you have DNS caching enabled in the settings?

Most commonly, this issue is caused by you using a DNS resolver which does not like the level of DNS requests it is getting. => your DNS Server is dropping SOME requests => have you enabled NSCD in the settings to lowered the amount of DNS requests to your TTL (instead of on every request)

TTL is 600
DNS chaing is enabled

CommanderStorm commented 3 months ago

I have no clue what could be causing this.

Lets rule out the stupid cauases first:

could you look in the log if NSCD has been successfully started? (possible cause: using a custom UUID/GUID)
have you verified that the TTL is actually 600?
coredns don't have any error logs

Just to make sure: you have activated https://coredns.io/plugins/errors/ and/or https://coredns.io/plugins/log/? What are the logs?

sudoexec commented 3 months ago

could you look in the log if NSCD has been successfully started? (possible cause: using a custom UUID/GUID)

ps aux show NSCD is running

have you verified that the TTL is actually 600?

I'm sure TTL is 600

coredns don't have any error logs

Just to make sure: you have activated https://coredns.io/plugins/errors/ and/or https://coredns.io/plugins/log/? What are the logs?

I enable errors plugin but not log plugin. I'll try to enable log plugin to find more details.

thielj commented 3 months ago

@sudoexec Alpine or other musl based Linux? Can you post a copy of your host's and the running container's /etc/resolv.conf?

I have seen similar issues in the past, including with Kubernetes, usually involving multiple DNS servers or related to search domains. The musl resolver would send out multiple parallel queries and ignore all replies but the first one. If that response was an error, this is what you would get. If the "good" lookup would usually win the race, you wouldn't see this error often.

Also, a regular nslookup or dig (or the DNS monitors in Kuma) do name service lookups differently than for example curl or http requests in Node which use the resolver (getaddrinfo) provided by the C library. Just had a quick google and these might give some background:

https://jvns.ca/blog/2022/02/23/getaddrinfo-is-kind-of-weird/ https://medium.com/@hsahu24/understanding-dns-resolution-and-resolv-conf-d17d1d64471c

(this is just a personal opinion, but I wouldn't touch nscd with a barge pole)

sudoexec commented 3 months ago

@thielj Host machine is ubuntu 18.04. Here are resolv.conf:

# Host
nameserver 119.29.29.29

# Container
nameserver 10.96.0.10                 # k8s coredns
search namespace.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Thanks for the info you provided, I've learned more abount DNS internal from it.

Additionally, I've added another nameserver to uptime kuma pod, and there're no errors in the past 2 days.

thielj commented 3 months ago

If you get more getadrinfo related errors: those resolv.conf settings and the internal DNS they lead to is the rabbit hole you need to dig into, all the way from the container/pod down your stack.

https://coredns.io/2017/06/08/how-queries-are-processed-in-coredns/

CommanderStorm commented 3 months ago

We should likely document this here https://github.com/louislam/uptime-kuma/wiki/Troubleshooting

What is your second nameserver? (how did you find it's IP? Do you have multiple coredns instances running?)

(Not a kubernetes/dns wizard 😅)

sudoexec commented 3 months ago

@thielj Thanks again for your help. I'll try it

@CommanderStorm

Additionally, I've added another nameserver to uptime kuma pod, and there're no errors in the past 2 days.

In fact,"another nameserver" is 1.1.1.1. In case it's caused by coredns.

thielj commented 3 months ago

@sudoexec This probably doesn't do what you expect, and if it does, you're relying on specific implementation behaviour of POSIX getaddrinfo. There are at least four different major implementations, and most of them can be further configured, see nsswitch.conf for an example.

The two most common, and their default behaviour with regards to the DNS resolver are:

glibc, which will query the first server, and if it replies saying that it can't resolve your name, that's the final result. Only if the first server doesn't reply at all within the timeout, glibc would move on. For the purpose of monitoring, this can effectively mask problems in your Kubernetes DNS setup. Unless you monitor to show off "all green" to your boss or a client, it's probably not what you want.
musl, which will query both servers in parallel, and the first to reply wins. If 1.1.1.1 is faster than coredns and says it's unresolveable, then that's the final result. This usually ends in your internal DNS winning the race 99.99% of the time. Instead of logging that your coredns is sometimes slow, you will log lookup failures (without knowing that they actually came from 1.1.1.1).

So: If you specify more than one server in resolv.conf, BOTH should be able to resolve ALL your hosts. If you want to implement fallbacks, query routing and such, configure a coredns or dnsmasq instance appropriately and point your resolv.conf to that. If you still want two DNS entries in your resolv.conf, configure two identically redundant instances.

Also, if you run frequent probes, you will eventually see failures. That's pretty normal. With a 99.99% reliability, a < 0.01% failure rate would be acceptable. Configure your probes to allow for one retry maybe?

Alpine/Musl

skrue commented 2 months ago

I started seeing this behavior after setting up AdGuard Home. In my previous setup I only had Unbound DNS running on my OPNsense router/firewall. Now, AdGuard will relay all requests that it doesn't decide to block to Unbound, so AdGuard is the primary DNS. My entire home network is whitelisted in AdGuard as is the Uptime Kuma IP, so no blocking should be happening there. I am running Uptime Kuma as an LXC container on my Proxmox host. getaddrinfo ENOTFOUND errors pop up roughly once a day for each monitor that I have configured. I have now increased the retry value from 0 to 2, let's see if that helps.

sudoexec commented 2 months ago

Weeks age, I change my upstream DNS (which is provided by cloud service and managed by systemd-resolved) to another 2 public DNS server. There's no getaddrinfo ENOTFOUND error again.

louislam / uptime-kuma