home-assistant / plugin-dns

CoreDNS implementation for Home Assistant
Apache License 2.0
19 stars 14 forks source link

CoreDNS is burning CPU and spamming my network with DNS requests to 1.1.1.1 #90

Closed j9brown closed 5 months ago

j9brown commented 2 years ago

Sorry I don't know what logs to capture to help pinpoint the issue more precisely, so please let me know if you'd like me to capture appropriate diagnostics.

I noticed that my Home Assistant OS running in a VM on an Intel NUC was using a lot more CPU than I remembered. And after logging in via SSH, I saw that the coredns process is using anywhere from 40-60% of the virtual CPU and had burned over 900 hours of processing time since its last restart!

As it happens, my router has been having connectivity issues lately so on a hunch I ran a tcpdump. There I found a constant endless stream of DNS requests to Cloudflare's 1.1.1.1 DNS service originating from the server that hosts my Home Assistant instance.

15:53:55.887610 IP XXX.lan.50080 > one.one.one.one.853: Flags [S], seq 4108605629, win 64240, options [mss 1460,sackOK,TS val 1419242137 ecr 0,nop,wscale 7], length 0
15:53:55.887936 IP one.one.one.one.853 > XXX.lan.50080: Flags [R.], seq 0, ack 4108605630, win 0, length 0

My router is configured to hijack DNS requests from the local network for privacy reasons. When I disable that feature, the spam stops and the excess CPU usage vanishes. So Home Assistant seems not to like the responses that it is receiving and it just keeps trying again and again, flooding the network.

Why is Home Assistant attempting to bypass my locally configured DNS server? And why is it retrying so aggressively?

Home Assistant Version: core-2022.5.0 Supervisor Version: supervisor-2022.05.0

j9brown commented 2 years ago

Ok, I see in other recently closed issues that the requests to Cloudflare's DNS service are intended behavior to implement DNS over TLS. As it happens, my router is already configured to take care of that for all clients on the local network (hence the DNS hijacking).

It's not clear to me whether the bug that causes the network to be flooded with requests has been fixed though. That was getting close to DoSing my router's DNS server from the inside.

mdegat01 commented 2 years ago

So I actually don't know and its bothering me. It shouldn't be doing this and I've made a note to look into it. It almost seems like blocking the fallback DNS server at the network level puts it in some kind of infinite loop and I need to find out why.

That being said there's a simple solution now. You can disable the fallback DNS feature by doing this:

ha dns options --fallback=false

This is better then leaving the feature enabled but blocked.

But yea at some point I will figure out why this happens and put a stop to it.

j9brown commented 2 years ago

I'm not familiar with the code, but the overall control flow looks like it could become unintentionally re-entrant if the chain of handlers was set up incorrectly somehow: https://github.com/home-assistant/plugin-dns/blob/master/plugins/fallback/fallback.go#L62

mdegat01 commented 2 years ago

Yea I was thinking that too. But what's weird is its set to fallback on all of these codes: https://github.com/home-assistant/plugin-dns/blob/0b27c4bae5cbbac9382f1d01964d8b471959ea6e/rootfs/usr/share/tempio/corefile#L24

So I would think if it was that it would also have this issue if you requested some non-existent name and got an NXDOMAIN response. I can't figure out what would be special about SERVFAIL.

But yea, I do think it might be that even if I can't explain it right now.

j9brown commented 2 years ago

Hmm. I suppose the loop could be further upstream in whatever code is triggering resolution in the first place. Ahh well, good luck!

vistalba commented 2 years ago

Just run into a wired state which my firewalls disk was full and not working anymore. My FW drops/blocks DNS/DoT to the Internet as the local one take care of this completly. I saw that the log file for 2 days grow to more than 50GB! While analysing the log file I can see that the a absolutli HUGE amount (more than 20 request/second) comres from HA which is trying to reach 1.1.1.1 and 1.0.0.1 with Port 853.

Can we please have an option to disable this behavior as in my case it is totally unneeded and unwanted.

Edit: Just to imagine what I mean: 1 second creates 4463 log entries which uses 1.1MB disk space! So when we do some math:

1h = ~ 3900MB 24 = ~ 92GB

Just for this kind of traffic! Even if I would preffer to disable this I cloud live with a normal amount of request... but this behavior is just spamming the network for absolutly nothing. So at least ratelimit at 1 request/s would be the minimum! (my point of view).

j9brown commented 2 years ago

There is already a way to disable the feature. See @mdegat01's comment above: https://github.com/home-assistant/plugin-dns/issues/90#issuecomment-1119139222

This issue is about fixing the looping bug.

d0nni3q84 commented 2 years ago

@mdegat01, as mentioned here a couple weeks ago, I've been able to root cause the runaway DNS queries as a result of what I call infinite looping configuration in /etc/corefile. The lines of configuration that are contributing to this issue are loop, fallback REFUSED,NXDOMAIN, and max_fails 0. Here's what happens:

  1. CoreDNS starts and executes the loop plugin, which sends a query for <random number>.<random number>.zone that gets forwarded externally and the Root Servers respond with NXDOMAIN.

  2. CoreDNS triggers the fallback plugin due to the NXDOMAIN response.

  3. CoreDNS now sends all queries, including health_check (NS IN .), to Cloudflare over TLS.

  4. User's firewall blocks DNS over TLS (TCP 853).

  5. CoreDNS triggers the fallback plugin due to the REFUSED response.

  6. Since max_fails 0 is set, CoreDNS assumes Cloudflare is always healthy.

  7. CoreDNS is now in an infinite loop continuously sending and retrying its health_check query. :nauseated_face:


Initial thoughts for this PR to fix this are:

  1. HINFO query from the loop plugin should not trigger fallback.
  2. Allow users to specify whether or not to use DNS or TLS when configuring the plugin-dns container.
  3. Don't assume Cloudflare will always be available.
  4. [Unrelated]: Remove policy sequential so not to overload a single user's DNS server.

Although, before reworking this configuration, is the narrative on why plugin-dns exists in the first place is to provide continuous access to well-functioning and available DNS servers?

CircuitGuy commented 2 years ago

@mdegat01, as mentioned here I've been able to root cause the runaway DNS queries as a result of what I call infinite looping configuration

Thanks for that awesome troubleshooting! One strange thing. I NAT transfer DNS over TLS to my firewall vs just plain blocking it. I thought a misconfiguration triggered this, but no - the redirect is working. I'm able to intercept 1.1.1.1:853 traffic in other applications. So, it should be getting a valid response. I'm assuming HA is (technically correctly) rejecting the cert from my firewall.

There's no "problem" with HA using DNS over TLS as the default IMO (per your suggestion 2) but it should accept the network authoritative server.

mschabhuettl commented 2 years ago

I discovered the same issue. As I force every network client to use my personal DNS Server (Pi-hole running unbound), and blocking every other DNS port in my router, I noticed that plugin-dns spams my network with cloudflare requests (resulting in high CPU/network usage). As for now, I'm using https://github.com/bentasker/HomeAssistantAddons/tree/master/core-dns-override to manually override DNS settings.

KevinCathcart commented 1 year ago

From @d0nni3q84:

Although, before reworking this configuration, is the narrative on why plugin-dns exists in the first place is to provide continuous access to well-functioning and available DNS servers?

My understanding is that the plugin handles two things. It handles making sure container names can be resolved by all containers, even those that use host networking, and via the fallback it works around a nasty issue where locally hosted DNS software interacts very poorly with alpine containers (because MUSL acts differently from GLIBC), which heavily impacts home assistant, where basically all the containers are Alpine Based.

Resolving containers by name in all containers

The idea here is to make it possible for the Home Assistant Core container to run in the host networking namespace while still being able to resolve containers by name. This also same applies for add-ons that opt into using the host networking namespace.

You see normally docker containers can refer to each other by name thanks to Docker running a stub DNS server on 127.0.0.11 which handles the names that docker controls and resolves others using the host's resolver. Unfortunately, this is not available to containers that run in the host's network namespace.

This is true even if the docker network is made available to the host, which makes it possible to access those containers by IP address.

By running a separate DNS server whose IP address is known to the supervisor, all the user facing containers can be configured to use it as their DNS server, and they can all resolve these names.

The nasty issue requiring fallback

Some software that users may be using locally to handle DNS handles missing AAAA records incorrectly by returning NXDOMAIN instead of NOERROR with an empty result set.

The problem is that MUSL assumes that the DNS servers are standards conformant, so when when looking for either A or AAAA records at one (the "happpy eyeballs" algorithm), if the AAAA record comes back as NXDOMAIN, it will not wait for the A record to come back, but immediately report that the name cannot be resolved. If the DNS servers used are strictly conformant this would never be an issue, but it is with the local software some uses have.

The end result is that some sites that are IPV4 only could randomly become inaccessible to Home Assistant, and for no obvious reason. Having this fallback system allows avoiding this, by asking Cloudflare if an NXDOMAIN came back, and returning a proper NOERROR instead. And if that is being done, it might as well try to handle other errors too, as some of them may be due to such software not understanding the results it got back from a recursive server.

The above is basically just a summary of what was described in this forum post.

Supervisor is now able to detect if this condition is happening, so they now allow you to disable the fallback. However, if you use that setting when you have a broken local DNS server you system will be marked as unsupported.

Originally the fallback was also available for use if all other servers got marked unhealthy, but for unclear reasons, the health checks were not working to recover the other DNS servers for some users, with undesirable consequences.


As for the rest of your analysis, there are some problems with it, although you do point out a key issue. In a follow-up post, I'll go over what is happening.

KevinCathcart commented 1 year ago

What is causing all these requests?

@d0nni3q84's analysis is close, but a little but off. (Like claiming that all requests were getting forwarded to the fallback). Rather than reply point by point I just give a correct explanation below.

Lets look at the tail end, first. What happens to an incoming request to port 5553 while Cloudflare is blocked?

Because max_fails 0 is set, it will spend 5 seconds repeatedly trying to contact both cloudflare servers in a loop. This would not happen if max_fails were not zero. After the 5 seconds are up it will return SERVFAIL.

So if the main chain's forward plugin ever returns an NXDOMAIN response, we will get a 5 second burst of attempted trafic to Cloudflare if it is blocked? No. It is much worse than that!

It turns out that that the forward plugin being used internally by the fallback plugin will only wait to read a response for two seconds, before declaring the request failed, and moving onto the next server, which is actually the same one again if not yet marked as failed. This will result in more than one overlaping 5 second mini DOS attack on the router firewall. But this is not the worst part.

No the worst part is that taking longer than two seconds means the forward plugin used by the fallback plugin thinks :5553 has failed, and begins sending health check messages every half second (since that is the default)! Each of these health check messages will result in 5 seconds of spamming the router firewall.

You were correct that the loop plugin would ensure that an NXDOMAIN lookup happens early in the process. It is also not great since the zone is ".", so the lookup is being sent to "..", which obviously your preferred recursive resolver server won't have cached, and will need to ask the root DNS servers for every time, which means this plugin is putting unnecessary load on the root DNS servers. (Footnote 1)

So what is up with the 5 second thing?

The only backend plugin in the fallback chain is the forward plugin. The forward plugin loops though a list of configured servers (the list may be permuted by the policy setting). It will try to contact each server in the list, skipping any that are marked as unhealthy. If it gets anything it can parse as a DNS response, it will return that, and the process ends. It continues to try more servers until it has reached a hardcoded 5 second timeout.

If it reaches the end of the list and there is still time on the 5 second timeout, it starts again at the beginning of the list. Unfortunately, because both Cloudflare servers are unreachable in this scenario, it will alternate back and forth between the two try over and over again until the 5 seconds have elapsed. If being blocked by a firewall, and thus getting an RST back, this could happen in a quite rapid fire way.

There is however a special case. If all of the servers in the list are unhealthy, then it will pick one at random and try it. If that one fails, it is done, and breaks out of the loop without needing to wait for the 5 seconds to be up. Unfortunately if max_fails is set to zero, the servers won't end up getting set to unhealthy.

This means that trying to avoid health check probes to Cloudflare by setting max_fails 0 actually made this meaningfully worse, as before then, the servers would get marked as down, and would only get one randomly selected as an attempt for each.

So what should be done?

The fallback chain should be configured with a template plugin to reply to the health check without forwarding. Even just returning REFUSED or SERVFAIL would help. This would prevent from piling up and compounding problems if Cloudflare is blocked. The fact that this would prevent :5553 from returning the nameservers for the root zone is not an issue, since :5553 is only used when an error has already occurred, and the other resolvers should have been able to handle that request without error. This alone would prevent the endless CPU usage, but would not prevent the 5 second mini DOS attacks on user's firewalls. The next suggestion would prevent that. (It should make this one unneeded too in theory, but defense in depth is probably better here).

The fallback chain really needs to re-enable health checks. If they work the way they are supposed to, they won't start getting sent until the first failed request to Cloudflare, and they stop instantly after after getting back something that looks like a DNS response. (Footnote 2) One request every 5 minutes indefinitely is far better than what max_fails 0 is doing with sending rapid fire requests.

More responses to @d0nni3q84

Initial thoughts for this PR to fix this are:

  1. HINFO query from the loop plugin should not trigger fallback.

This would conflict with your next point. It would arguable be okay to do this if DNS over TLS or DNS over HTTPS is enforced (as it must be anyway, see next response).

Ignoring that though with normal TCP or UDP, if the router is potentially intercepting requests being made to Cloudflare, those requests could form a loop, and should be detected. (Footnote 3) Excluding that request only makes sense .

  1. Allow users to specify whether or not to use DNS or TLS when configuring the plugin-dns container.

The whole point of the fallback is that the local router cannot intercept it and do the wrong thing, so it really needs to remain using TLS (or DNS over HTTPS).

  1. Don't assume Cloudflare will always be available.

Yeah, this was a real problem.

  1. [Unrelated]: Remove policy sequential so not to overload a single user's DNS server.

That would cause issues with letting people specify preferred DNS servers via ha dns options --servers, since those get prepended to the servers found via DHCP. Sequential allows for preferentially using the user specified ones, but falling back to the DHCP provided ones if the user specified ones don't respond at all is preferable.


Footnotes

Footnote 1 (click to expand) This could be addressed by having the first zone listed for the main plugin pipeline be a zone that definitely does not exist, like `looptest.home-assistant.io.`, or `invalid.`. These would yield lookups for `..looptest.home-assistant.io` or ``..invalid`. The important thing is that there be an NXDOMAIN that the recursive resolver can cache so it does not need to ask the root servers every time. The loop plugin uses the first zone defined for the plugin chain, so you would simply replace `.:53 {...}` with `invalid .:53 {...}` to do that. (The syntax allows multiple whitespace separated zones before the port number) Obviously being nice to the root servers is not required.
Footnote 2 (click to expand) Of course perhaps the health checks are buggy. I'm still not sure that the cause is known for why health checks did not restore local DNS as preferred back when the fallback was in the main forward plugin.
Footnote 3 (click to expand) You could have a scenario like follows: Router intercepts these requests and is configured to forward those them to a PiHole-like system running in a Home Assistant addon. That addon uses the CoreDNS server as its upstream. The loop addon would detect the obvious loop, so the user sets DNS servers with `ha dns options --server` to some upstream DNS that is whitelisted from interception by the router. Unfortunately there is still a loop that would happen for any non-existent domains queried, which would hit the fallback case, try to forward to Cloudflare, get intercepted and forwarded to the Pi-Hole like software, which asks CoreDNS again... The loop plugin would detect this, but it can only do so if fallback does not block its query.
CafeLungo commented 1 year ago

This really needs fixed. I'm brand new to setting up Home Assistant. I setup the latest VM (VMware ESXi/vSphere) and my network uses DHCP, and assigns a valid and working DNS server. And I'm getting blocked firewall logs constantly for access to 1.1.1.1:853 and 1.0.0.1:853. As far as I am concerned, there is absolutely zero reason for this VM to ever reach out to these IPs.

image

Running the ha dns options --fallback=false did fix the problem for me, though. But I should not need to run that on a new setup, with a valid DNS setup.

BTW, I did not read any of the reasoning for why this happens or what it is trying to do-- As a server admin, I would never expect such an aggressive query against blocked DNS servers, especially when there is a valid DNS configured for the host. Having to run the google search and parse through forum posts and several closed github issues before any mention of a quick workaround is not a good experience.

I do appreciate any effort put into resolving this issue, though. :-)

CircuitGuy commented 1 year ago

I setup the latest VM (VMware ESXi/vSphere) and my network uses DHCP, and assigns a valid and working DNS server

For anybody reproducing this, I think this is the key. I doubt most users would really notice this running on a raspberry Pi or similar low-power hardware. I put it in a VM with a virtual 10 GBps network connection to my firewall/router on the same VM host. Both HA and the firewall were very low latency and snappy to respond. Under that situation, Home Assistant is aggressive enough to nearly saturate the 10 Gb connection with DNS requests and commensurately burn some serious CPU power in both HA and the firewall 24/7.

j9brown commented 1 year ago

Ok, I just hit the same issue again. I set up a fresh instance of Home Assistant and several days later I finally notice that it's been spamming my network with gigabytes worth of pings to 1.1.1.1 -- out over StarLink!

Please can we do something to fix this? At the very least try not to flood the network.

jhbruhn commented 1 year ago

I can confirm that running ha dns options --fallback=false does indeed not fix the issue. This is happening since I set the DNS server distributed via DHCP to my HA instance running AdGuard Home. HA is configured to use a static IP and another local DNS server (not itself via AdGuard Home).

kode54 commented 1 year ago

Okay, this appears to be why my Home Assistant OS instance is spamming the crap out of Cloudflare DNS and using 140-200% CPU all the time.

Edit: My network is configured to use my OpenWrt router as DNS, using https-dns-proxy, dnsmasq, and an Adblock DNS list script from the OpenWrt repository. The Home Assistant VM should not be doing any special DNS setup, whatsoever.

Confirmed that using ha dns options --fallback=false stops the refused connections to Cloudflare DNS, and fixes the CPU usage problem.

Edit 2: Use the damn canary domains! I configured those properly at least for Mozilla software and Apple software.

churchofnoise commented 1 year ago

Agree with @kode54 that the proper way would be to use the Canary domains. First of all because that is 'best practice', secondly because it avoids confusion for those not familiar with the command to disable the fallback (which could happen if you're a first time user, let's be honest) and thirdly because there is no point in running coredns whatsoever except in case of issues (e.g. when the canary domains don't work).

This should be much more 'opt-in' or 'in case of issues' rather than the partial 'opt-out' today.

deanfourie1 commented 1 year ago

Im confused now.

I set ha dns options --fallback=false and set a static DNS server in my IPv4 config, yet with a capture I see not 1 DNS request sent to my DNS server, instead sent to 8.8.8.8 and my gateway?

None of these addresses are even handed out as DNS from DHCP.

How are you guys defining your static DNS servers?

churchofnoise commented 1 year ago

I have my Pi-Hole address handed out from DHCP as well, and the router also acts as a firewall and catches naughty DNS requests to other addresses and redirects them. Very much stunned that HA is not respecting DNS addresses - and ANOTHER reason to make this coredns stuff opt-in.

defanator commented 10 months ago

Just did a first installation of Home Assistant OS on rpi3 just to give it a try and test some of zigbee-enabled hardware. My pi is burning with 70+ degrees with coredns in the top for CPU time and I'm seeing this in logs:

➜  log docker logs hassio_dns 2>/dev/null | tail -30
[INFO] 127.0.0.1:38539 - 28343 "NS IN . udp 17 false 512" NOERROR - 0 5.007165535s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: connect: connection refused
[INFO] 127.0.0.1:58223 - 23638 "NS IN . udp 17 false 512" NOERROR - 0 5.001772123s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:43302 - 63044 "NS IN . udp 17 false 512" NOERROR - 0 5.001283552s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: connect: connection refused
[INFO] 127.0.0.1:51113 - 35362 "NS IN . udp 17 false 512" NOERROR - 0 5.003702062s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:38262 - 56755 "NS IN . udp 17 false 512" NOERROR - 0 5.002333335s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:33174 - 9323 "NS IN . udp 17 false 512" NOERROR - 0 5.003397732s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:33094 - 32776 "NS IN . udp 17 false 512" NOERROR - 0 5.010711032s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:44416 - 48101 "NS IN . udp 17 false 512" NOERROR - 0 5.002663142s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: connect: connection refused
[INFO] 127.0.0.1:35474 - 16537 "NS IN . udp 17 false 512" NOERROR - 0 5.005863059s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: connect: connection refused
[INFO] 127.0.0.1:50759 - 43058 "NS IN . udp 17 false 512" NOERROR - 0 5.000412148s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:33519 - 45040 "NS IN . udp 17 false 512" NOERROR - 0 5.004900032s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:39739 - 59567 "NS IN . udp 17 false 512" NOERROR - 0 5.005401149s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:39157 - 50430 "NS IN . udp 17 false 512" NOERROR - 0 5.00624565s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:48527 - 47596 "NS IN . udp 17 false 512" NOERROR - 0 5.003011352s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:60637 - 59663 "NS IN . udp 17 false 512" NOERROR - 0 5.001902262s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: connect: connection refused

So my guess is that I'm hitting one of the issues highlighted in this report. I do have an openwrt-powered router with DoH enabled. Any pointers on how to normalize this are greatly appreciated.

Version info: Home Assistant 2023.10.3 Supervisor 2023.10.0 Operating System 11.0 Frontend 20231005.0 - latest

UPDATE: the workaround from https://github.com/home-assistant/plugin-dns/issues/90#issuecomment-1119139222 did the trick for me, LA dropped from 3+ to ~0.4, CPU temperature from 70+ to ~54. I'm open to test any other approaches that do not require disabling the fallback logic.

agners commented 5 months ago

While looking into CoreDNS in general today, I've reproduced this problem on my end today. Just adding a reject firewall rule for port 853 TCP let's it easily reproduce.

@KevinCathcart your analyzes seems spot on. Some points to add:

This means that trying to avoid health check probes to Cloudflare by setting max_fails 0 actually made this meaningfully worse, as before then, the servers would get marked as down, and would only get one randomly selected as an attempt for each.

While not explicitly stated, I guess one of the attempts in https://github.com/home-assistant/plugin-dns/pull/82 was also to avoid health checks when the primary DNS server is working. Essentially, make sure we don't contact Cloudflare if not necessary. But currently, the loop plug-in will still cause at least one resolve attempt. And for the reasons you've outlined, this is problematic when access to Cloudflare using DNS over TLS is blocked.

I've tested your suggestion of adding a template plug-in to avoid health checks and re-enable health checks on Cloudflare (using max_fails). This works exactly as you've suggested.

It does lead to 30s long storm whenever the loop plug-in is probing on first startup still (30s comes from the loop plug-in default behavior: "will try to send the query for up to 30 seconds"). I think it is safe to assume that Cloudflare won't lead to a DNS loop, so I went with another template plug-in to handle the loop request. This also avoids the (additional) request to the root servers through Cloudflare, and any request to Cloudflare when the primary DNS is working (so also in the case when Cloudflare using DNS over TLS is reachable).