home-assistant / plugin-dns

CoreDNS implementation for Home Assistant
Apache License 2.0
19 stars 14 forks source link

performance impact after dns update to 2021.06.0 #50

Closed milutt closed 2 years ago

milutt commented 3 years ago

Since dns upgrade to 2021.06.0, my complete hassio setup is having performance issues. I am running haos on Raspberry Pi 1B. It's an old pi, but before dns 2021.06.0 everything was running without issues and I had no reason to upgrade hardware. Since dns upgrade, coredns will eventually get stuck at more than 60% CPU usage constantly and everything else slows down to the level that it's unusable. Even 'ha dns restart' is failing with time out. It's happening also with clean image install without configuring any integrations. When I downgrade to dns 2021.04.0 using 'ha dns update --version 2021.04.0', CPU usage is back to normal and whole system is responsive. Downgrading dns is not permanent fix as it gets automatically updated back to last version and CPU load increases again.

Is there an option to permanently downgrade to dns 2021.04.0 or disable DoT completely (if TLS is causing too much load on rpi1)?

dns logs using 2021.06.01: [INFO] 127.0.0.1:45539 - 6781 "NS IN . udp 17 false 512" NOERROR - 0 30.016215226s [ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out [INFO] 127.0.0.1:47240 - 31621 "NS IN . udp 17 false 512" NOERROR - 0 30.014927277s [ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out [INFO] 127.0.0.1:39746 - 20863 "NS IN . udp 17 false 512" NOERROR - 0 30.018269213s [ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out [INFO] 127.0.0.1:40733 - 54343 "NS IN . udp 17 false 512" NOERROR - 0 30.006049544s [ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out [INFO] 127.0.0.1:47464 - 56063 "NS IN . udp 17 false 512" NOERROR - 0 35.378188925s [ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out [INFO] 127.0.0.1:33661 - 11713 "NS IN . udp 17 false 512" NOERROR - 0 30.888139468s [ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out [INFO] 127.0.0.1:54301 - 4260 "NS IN . udp 17 false 512" NOERROR - 0 30.002718645s [ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out [INFO] 127.0.0.1:33394 - 6453 "NS IN . udp 17 false 512" NOERROR - 0 30.032896855s [ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out [INFO] 127.0.0.1:35429 - 10631 "NS IN . udp 17 false 512" NOERROR - 0 30.012585403s [ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out

dns logs using 2021.04.0 (everything else the same, just downgraded dns): [INFO] 172.30.32.1:39653 - 7139 "PTR IN 43.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.142744402s [INFO] 172.30.32.1:47120 - 56986 "PTR IN 1.178.168.192.in-addr.arpa. udp 44 false 512" NXDOMAIN qr,aa,rd,ra 44 0.04007399s [INFO] 172.30.32.1:42934 - 43274 "PTR IN 2.0.17.172.in-addr.arpa. udp 41 false 512" NXDOMAIN qr,aa,rd,ra 41 0.074434124s [INFO] 172.30.32.1:60416 - 61087 "PTR IN 80.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.035992093s [INFO] 172.30.32.1:48272 - 61848 "PTR IN 84.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.024411385s [INFO] 172.30.32.1:52543 - 46777 "PTR IN 43.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.107825254s [INFO] 172.30.32.1:44524 - 31012 "PTR IN 1.178.168.192.in-addr.arpa. udp 44 false 512" NXDOMAIN qr,aa,rd,ra 44 0.047429792s [INFO] 172.30.32.1:49698 - 36720 "PTR IN 2.0.17.172.in-addr.arpa. udp 41 false 512" NXDOMAIN qr,aa,rd,ra 41 0.032395175s [INFO] 172.30.32.1:35650 - 10995 "PTR IN 80.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.043622889s [INFO] 172.30.32.1:39872 - 23006 "PTR IN 84.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.046617813s [INFO] 172.30.32.1:60757 - 62788 "PTR IN 43.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.10125965s [INFO] 172.30.32.1:43480 - 13081 "PTR IN 1.178.168.192.in-addr.arpa. udp 44 false 512" NXDOMAIN qr,aa,rd,ra 44 0.050008839s [INFO] 172.30.32.1:39897 - 45303 "PTR IN 2.0.17.172.in-addr.arpa. udp 41 false 512" NXDOMAIN qr,aa,rd,ra 41 0.091132884s [INFO] 172.30.32.1:54320 - 64334 "PTR IN 80.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.116535295s [INFO] 172.30.32.1:59641 - 63561 "PTR IN 84.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.049265856s [INFO] 172.30.32.1:54196 - 61682 "PTR IN 43.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.029521276s [INFO] 172.30.32.1:33093 - 37580 "PTR IN 1.178.168.192.in-addr.arpa. udp 44 false 512" NXDOMAIN qr,aa,rd,ra 44 0.038741049s [INFO] 172.30.32.1:44545 - 42001 "PTR IN 2.0.17.172.in-addr.arpa. udp 41 false 512" NXDOMAIN qr,aa,rd,ra 41 0.032553201s [INFO] 172.30.32.1:37139 - 44637 "PTR IN 80.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 2.032197136s [INFO] 172.30.32.1:43724 - 10416 "PTR IN 84.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.104368439s

core-2021.6.6 supervisor-2021.06.6 Home Assistant OS 6.1 CPU armv6l

Tallyrald commented 3 years ago

I experience the same, can't really add much else. Pi gets bogged down by the coredns process. Maybe this has something to do with the fallback plugin fix? ( #47 )

I suspect something is stuck in a retry loop without any backoff/stop instruction.

Tallyrald commented 3 years ago

After observing behaviour (sadly I found no logs that would back me up here) I found that most of the dns resolution / connection errors actually happen because of the CPU overload caused by the DNS plugin. Errors start within a couple minutes after boot, CPU spikes to 45% on the coreDNS process, never goes below that. Several hours later this percentage starts to climb as more & more connection errors happen. Finally HA locks up & RPi(1B) needs a reboot.

All in all this makes HA completely unusable. Reverting to 2021.04.0 solves the problem until the supervisor decides to auto-update the plugin-dns again which means that all my observations are in line with what OP described. (We seriously need manual control over updates to be honest, especially since HA supports so many different platforms)

I tried the same setup on a Windows machine using HyperV but the problem never came up. I suspect something (fallback?) is not respecting when a query takes 'too long' & initiates a new query again & again. Which is kinda undesirable since it locks up the whole service. This also makes HA partially dead whenever the network is offline which is not uncommon to see given that HA is supposed to be privacy-first & cloud-free (if the user wants that).

Is there a way for me to help debug the problem? If someone could write instructions on how to develop & test the dns plugin (locally using docker I guess), I would gladly try to help. Unfortunately I'm not a golang expert although I am a software developer (mostly familiar with js/ts).

dMopp commented 2 years ago

Its a problem with the fallback DNS where the developers "dont want a discussion"... for whatever reason.

If you have a WORKING DNS Setup, you could do the following:

1.) make sure, HA dns is using you own DNS server: ha dns info

host: 172.30.32.3
locals: []
servers:
- dns://<YOUDNSSERVERIP>
update_available: false
version: 2021.06.0
version_latest: 2021.06.0

if its not the case, run ha dns options --servers dns://<YOUDNSSERVERIP>

2.) comment out unused fallback DNS

docker exec -it hassio_dns bash
vi /usr/share/tempio/corefile

comment out the line fallback REFUSED,SERVFAIL,NXDOMAIN . dns://127.0.0.1:5553 save and exit container ha dns restart

This should solve CPU issues until next release.. (As long as you have a working DNS setup, but i dont know how often the template file gets overwritten).

The discussion about that hardcoded cloudflare DNS servers is complete useless, because the devs do not want to discuss that.

And no, Fallback is NOT required to have HA working. Iam blocking DOH and DOT for the known public servers in my firewall and have no issues at all.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Strohhutpat commented 2 years ago

Still present in 2021.12.7.

FileGo commented 2 years ago

I confirm, just got my syslog and user.log files grow over 20GB each due to this.

redgryphon commented 2 years ago

Could it be fixed by #82?

mdegat01 commented 2 years ago

@redgryphon I don't believe so, I just noticed this on a dev system recently even after that PR because I forgot to unblock cloudflare DoT for it. However closing this because the new option to disable the fallback DNS here does fix this: https://github.com/home-assistant/supervisor/pull/3586