louislam / uptime-kuma

A fancy self-hosted monitoring tool
https://uptime.kuma.pet
MIT License
57.47k stars 5.19k forks source link

[dns] query A fails #860

Closed arch1v1st closed 7 months ago

arch1v1st commented 2 years ago

👟 Reproduction steps

Setup a DNS Monitor using the default CloudFlare Resolver Server of 1.1.1.1

👍 Expected behavior

Monitor shouldn't trigger as DOWN regularly when the actual domain's DNS is resolving just fine.

To better diagnose the underlying problem I setup a nearly identical UK DNS monitor using Google DNS (8.8.8.8/8.8.4.4), and no UK incidents have been experienced seen since! The other added bonus - Google DNS seems to support 'ANY/ALL' DNS queries whereas CloudFlare does not, meaning we have a way to gather most of the DNS record types for the domain.

👎 Actual Behavior

UK frequently detects the domain's DNS A record as DOWN with the message:

queryA ESERVFAIL domain.com

We have many A Record DNS Monitors in place for multiple domain names; experienced this across all of them.

🐻 Uptime-Kuma version

1.9.1

💻 Operating System

Ubuntu 20.04

🌐 Browser

Any

🐋 Docker

N/A

🏷️ Docker Image Tag

N/A

🟩 NodeJS Version

14.8.1

📝 Relevant log output

Up  2021-10-31 01:16:24 Records: 123.123.123.123
Down    2021-10-31 01:15:01 queryA ESERVFAIL domain.com
Up  2021-10-30 19:24:56 Records: 123.123.123.123
Down    2021-10-30 19:23:32 queryA ESERVFAIL domain.com
Up  2021-10-30 15:42:27 Records: 123.123.123.123
Down    2021-10-30 15:41:04 queryA ESERVFAIL domain.com
Up  2021-10-30 12:49:59 Records: 123.123.123.123
Down    2021-10-30 12:48:35 queryA ESERVFAIL domain.com

⚠️ Please verify that this bug has NOT been raised before.

🛡️ Security Policy

louislam commented 2 years ago

I cannot reproduce with 1.1.1.1

using Google DNS (8.8.8.8/8.8.4.4), and no UK incidents have been experienced seen since!

Sounds like it is your network issue between 1.1.1.1

arch1v1st commented 2 years ago

@louislam - I appreciate your time at looking at this further so quickly. I also found it strange that one of the worlds largest DNS providers (CloudFlare) had these sort of recurring issues (simple A name lookup!), and still scratching my head as to why the UK dns_resolver setting had such a positive impact after switching to Google DNS. Had both going as their own UK monitors every minute for days, and was getting random yet daily DOWN notifications only for the CloudFlare based monitors. Another finer detail here - I am running UK on an AWS medium sized EC2 instance - maybe the fact that its on Amazon plays a role with this.

ALL - if you have experienced similar issues, please chime in here!

chakflying commented 2 years ago

If you are running a large number of DNS monitors, did you test what happens if you switch all of them to 8.8.8.8? In theory dns.resolve() should not be overloaded so easily because it's async, but there might be something in the networking stack that's reusing the connection, or maybe it's cloudflare that's implementing a rate limit.

arch1v1st commented 2 years ago

@chakflying - Running only a handful of DNS monitors overall, and all have been reconfigured to use 8.8.8.8 Started to notice the resolution problems with only 2 at the time against 1.1.1.1.

kingforaday commented 2 years ago

I also have experienced the same issue. I also thought it was something with 1.1.1.1, so I switched all my DNS monitors (2 of them) to 8.8.8.8 as the resolver. The problem went away.

I'm not discounting potential networking issues and the Uptime Server is hosted on a dedicated machine in DigitalOcean.

image
SteveD70 commented 2 years ago

same issue. checked and all my DNS servers are live.

christopherpickering commented 2 years ago

I started getting this after release 18. The only change in the monitor code was the dns cache. I'm using an internal DNS with a ton of monitors, but only three specific monitors for Apache solr are failing. Other sites monitored on the same server resolve properly.

Wonder if it is because of the port or something? The failing urls are like http://server:8983/solr/ and passing urls are like http://dns-on-same-server.

I tried adding the server name to the host file but no luck.

Any other ideas?

https://github.com/louislam/uptime-kuma/commit/2073f0c28476bb46fb953ecefb9622273e8819d9

christopherpickering commented 2 years ago

@louislam for my special case, I changed from server name to ip address and it works. I supposed because my server name is not the A record on the dns. I wonder if something changed in node dns resolve function to make this happen because the changes in v18 do not seem to be related to how the dns is resolved.

louislam commented 2 years ago

@louislam for my special case, I changed from server name to ip address and it works. I supposed because my server name is not the A record on the dns. I wonder if something changed in node dns resolve function to make this happen because the changes in v18 do not seem to be related to how the dns is resolved.

You mean Node.js v18 or Uptime Kuma 1.18.0?

My custom dns with port is working fine. I may need more info. image

christopherpickering commented 2 years ago

Yeah, its odd, I tried it on my server and it doesn't work but from my laptop no problem. I tried added the server name to the uptime server host file, but still no luck. I was refering to Kuma 1.18, but I don't see how any changes in kuma would have changed my server lookup... and only for the one server. I reference TCP pings by server name and they all work. Maybe its a fluke.

christopherpickering commented 2 years ago

I had a few other monitors one like this that started failing w/ the queryA ESERVFAIL and left the server rebooted. I left them and after 1 day they went away. There must be some other cache/matching that happens elsewhere causing it for me... I did reset the server dns cache (which is also probably what happened when the server rebooted).

ljurk commented 2 years ago

I have the same Issue, starting with kuma version 1.18, I got queryA ESERVFAIL for all hostnames, that aren't on public dns-servers, but only on our own Windows-DNS-server. I tried using the 1.17 image and in this version it's working, kuma can resolve all hostnames. The problem started a week ago and never healed itself

christopherpickering commented 2 years ago

Do you have a mix of public/not public sites? I wonder if it is because the cached lookup key is based on the options (maxCachedSessions: 0) and could maybe be based on something that is more unique to the monitor? From the new cache code, it looks like the agent is shared among all the monitors now whereas before it was unique to a monitor. Maybe monitor ID can be added to the cache key?

Here's the code that changed in the last release. Not much changed, but I'm wondering if it is because now the agent is shared whereas before it was not? I'm not a subject matter expert though. https://github.com/louislam/uptime-kuma/commit/2073f0c28476bb46fb953ecefb9622273e8819d9

What do you think @louislam ?

ljurk commented 2 years ago

Yes, I have a mix of public and non-public sites. Public sites worked all the time, non-public didn't work in 1.18 But I just tested another thing in 1.18 with non-public hosts: I added the windows domain-name to the URL and now kuma can resolve the hostname. So http://web1 is not working, but http://web1.mydomain.example.com is working. In my case it's enough knowing this, I don't need to resolve the hostname without the domain, I'm ok with adding the domain to all my hosts.

louislam commented 2 years ago

Do you have a mix of public/not public sites? I wonder if it is because the cached lookup key is based on the options (maxCachedSessions: 0) and could maybe be based on something that is more unique to the monitor? From the new cache code, it looks like the agent is shared among all the monitors now whereas before it was unique to a monitor. Maybe monitor ID can be added to the cache key?

Here's the code that changed in the last release. Not much changed, but I'm wondering if it is because now the agent is shared whereas before it was not? I'm not a subject matter expert though. 2073f0c

What do you think @louislam ?

I added cacheable-lookup into Uptime Kuma, so it will cache dns records.

Windows-DNS-server

@ljurk Do you mean the DNS Server that can be installed in Windows Server? I may need a proper step in order to reproduce the issue.

christopherpickering commented 2 years ago

I'm just wondering if they problem w/ the short names is that the cached dns record is shared over every monitor using the same connection options? Should that key be more complex (include the ID of the monitor for example)?

image

louislam commented 2 years ago

I'm just wondering if they problem w/ the short names is that the cached dns record is shared over every monitor using the same connection options? Should that key be more complex (include the ID of the monitor for example)?

image

I don't think so, because under same agent options, http agent is reusable. HTTP agent is not specified for only one domain.

You can see the example in https://github.com/szmarczak/cacheable-lookup#attaching-cacheablelookup-to-an-agent

And so far, I do not receive large amount of similar bug reports, so I assumed that it should be very specific issues like @ljurk said, he is using Windows DNS Server

ljurk commented 2 years ago

@louislam Yeah, I'm inside a windows domain. The domain controller is used for DNS and is running Windows Server. My docker host is running Ubuntu, it gets the dns-ip via dhcp and I didn't change any dns related stuff.

dnldpavlik commented 1 year ago

I have a similar issue, if not the same issue. My current setup has a PiHole operating as a DNS server where I have defined DNS entries; my Raspberry Pi has its DNS configured to go to the PI for all DNS inquiries. This works fine for all cases to resolve a locally defined address.

I can ping the address in the uptime container, and it resolves fine, but when using the name in uptime, it gives me a "queryAaaa ESERVFAIL" error. But when I use the static IP address it works fine for the status monitoring of my HTTP site.

kevin7s-io commented 1 year ago

I'm seeing the same behavior.

Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.

dnldpavlik commented 1 year ago

I have noticed something, when I have internal names that are even numbered, i.e. aaa.bbb.ccc.ddd or aaa.bbb, it does not resolve, but when it is an odd number of elements for the URI it tends to resolve. aaa.bbb.ccc.ddd.eee or aaa.bbb.ccc. This is not 100%, but it has helped me get some items registered by the DNS entry vs IP, which I prefer the DNS entry.

louislam commented 1 year ago

Cacheable-lookup is not working properly in some cases. With 1.19.x, DNS cache now could be disabled in Settings.

PacmanForever commented 1 year ago

I'm seeing the same behavior.

Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.

Hi, I see that my uptime kuma can't monitor my pihole. Both are containers in the same machine. I guess the same thing happens to me as you, but I don't understand what you say to solve it. Can you explain it to me? thank you

louislam commented 1 year ago

I'm seeing the same behavior. Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.

Hi, I see that my uptime kuma can't monitor my pihole. Both are containers in the same machine. I guess the same thing happens to me as you, but I don't understand what you say to solve it. Can you explain it to me? thank you

Are they in the same docker network?

dnldpavlik commented 1 year ago

For me they are on different machines, and I am.using pihole for DNS. In order to get it to resolve I originally had ui.app.domain.com, this didn't resolve for uptime, but when I changed the name to app.domain.com it worked, and also logs.api.app.domain.com would work because it has five parts.["logs","api","app","domain","com"]

PacmanForever commented 1 year ago

I'm seeing the same behavior. Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.

Hi, I see that my uptime kuma can't monitor my pihole. Both are containers in the same machine. I guess the same thing happens to me as you, but I don't understand what you say to solve it. Can you explain it to me? thank you

Are they in the same docker network?

No, pihole: 172.18.0.7 (docker IP) u. kuma: 172.16.0.4 (docker IP)

but I use the ip of the host (192.168.1.2) in the same way that other monitors of other containers have the same ip to monitor with ping, etc

PacmanForever commented 1 year ago

For me they are on different machines, and I am.using pihole for DNS. In order to get it to resolve I originally had ui.app.domain.com, this didn't resolve for uptime, but when I changed the name to app.domain.com it worked, and also logs.api.app.domain.com would work because it has five parts.["logs","api","app","domain","com"]

I only use ip addresses.

burnthoney commented 1 year ago

i seem to have the same problem. they're also on the same docker network (also tried separating it and same problem) this is my configuration. (I have kept 20s for sake of testing) image

burnthoney commented 1 year ago

i seem to have the same problem. they're also on the same docker network (also tried separating it and same problem) this is my configuration. image

i kept 20seconds for the sake of testing

CommanderStorm commented 7 months ago

What we need is a CURRENT, publicly accessible (=reproducible) testcase. The first part of this issue was resolved when we switchd from cachable lookup to NSCD (Name Service Cache Daemon)

image

The other comments are likely unrelated to the first one. I think continuing this issue in less messy smaller issues (=issues which allow reproduction) is more productive rather than piling onto a resolved issue => closing as resolved