home-assistant / plugin-dns

CoreDNS implementation for Home Assistant
Apache License 2.0
17 stars 14 forks source link

High cpu usage / server reboot with 2024.03.0 #130

Closed plord12 closed 3 months ago

plord12 commented 4 months ago

I've just upgraded to 2024.03.0 and found although home assistant starts, after a short time cpu usage goes to 100% and then reboots.

I'm running -

# ha info
arch: aarch64
channel: stable
docker: 24.0.7
features:
- reboot
- shutdown
- services
- network
- hostname
- timedate
- os_agent
- haos
- resolved
- journal
- disk
- mount
hassos: "12.0"
homeassistant: 2024.3.0
hostname: arm1
logging: info
machine: odroid-c2
operating_system: Home Assistant OS 12.0
state: startup
supervisor: 2024.03.0
supported: true
supported_arch:
- aarch64
- armv7
- armhf
timezone: Europe/London

Now one thing I noticed (just before the server rebooted itself) is very high cpu usage for dns and adguard containers -

# docker stats
CONTAINER ID   NAME                                        CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
c46026c31e53   hassio_multicast                            0.13%     1.492MiB / 1.822GiB   0.08%     0B / 0B           6.57MB / 295kB    8
161b406ce295   hassio_audio                                0.00%     12.62MiB / 1.822GiB   0.68%     238kB / 0B        25.3MB / 389kB    12
05df21d2f21c   hassio_dns                                  103.10%   470.2MiB / 1.822GiB   25.20%    7.65MB / 7.11MB   31.4MB / 311kB    24
890f15f4f911   hassio_cli                                  0.00%     1.75MiB / 1.822GiB    0.09%     247kB / 8.54kB    12.9MB / 299kB    9
fb5b9bd2cacc   hassio_supervisor                           10.11%    79.68MiB / 1.822GiB   4.27%     5.38MB / 5.85MB   113MB / 1.12MB    26
e69a4ee73a04   homeassistant                               9.48%     389.1MiB / 1.822GiB   20.85%    0B / 0B           355MB / 19.6MB    36
7f3cba1db195   hassio_observer                             0.00%     1.188MiB / 1.822GiB   0.06%     239kB / 0B        6.92MB / 4.1kB    5
ec75cfe9fcf2   addon_a0d7b954_chrony                       0.02%     3.027MiB / 1.822GiB   0.16%     244kB / 8.06kB    13.2MB / 446kB    8
62f9f4b614b8   addon_core_mosquitto                        0.06%     20.73MiB / 1.822GiB   1.11%     282kB / 53.9kB    58.1MB / 348kB    21
85a2d8f52022   addon_core_ssh                              0.00%     1.945MiB / 1.822GiB   0.10%     229kB / 1.55kB    5.39MB / 434kB    11
f3242bb21b93   addon_1315902c_signal_messenger             0.00%     13.71MiB / 1.822GiB   0.74%     223kB / 0B        60.6MB / 16.4kB   9
7dc60da19fb8   addon_a0d7b954_adguard                      75.55%    162.3MiB / 1.822GiB   8.70%     0B / 0B           74.3MB / 532kB    25
f82abd6b84f5   addon_5c53de3b_esphome                      0.33%     18.24MiB / 1.822GiB   0.98%     0B / 0B           65.6MB / 729kB    12
f661e37c9782   addon_cebe7a76_hassio_google_drive_backup   1.49%     23.48MiB / 1.822GiB   1.26%     332kB / 63.4kB    33.2MB / 1.06MB   13
26998772fdd2   addon_core_configurator                     0.02%     17.68MiB / 1.822GiB   0.95%     634kB / 6.62kB    35.7MB / 340kB    8
037138e53d18   addon_a0d7b954_wireguard                    0.00%     6.941MiB / 1.822GiB   0.37%     145kB / 3.03kB    6.41MB / 520kB    13
5aa9fcff583d   addon_f4f71350_ewelink_smart_home_slug      7.05%     47.5MiB / 1.822GiB    2.55%     551kB / 131kB     48.4MB / 106kB    25
a3ea669cad96   addon_a0d7b954_sqlite-web                   0.01%     27.32MiB / 1.822GiB   1.46%     138kB / 3.03kB    37.6MB / 483kB    11

I use adguard for both a DHCP and filtering DNS server, it also resolves hosts on the local network. So I configure homeassistant to use the adguard DNS server (so that scripts etc can use hostnames) -

# ha dns info
fallback: true
host: 172.30.32.3
llmnr: true
locals:
- dns://192.168.175.10
mdns: true
servers:
- dns://192.168.175.10
update_available: false
version: 2024.03.0
version_latest: 2024.03.0

If I change the home assistant servers to be a pulic DNS server -

# ha dns option --servers dns://194.168.4.100

Command completed successfully.

Then the high cpu usage and reboot goes away ... however this means that many scripts fail since local network hostnames are not resolved.

DNS resolution on the local network (to adguard) works as expected BTW.

Looks like a new issue with 2024.03.0 to me.

Any thoughts ? I guess I should also post to plugin-dns github issues.

(also posted to https://community.home-assistant.io/t/startup-failure-after-2024-03-0-upgrade-dns-issues/703118)

agners commented 4 months ago

That indeed sounds like an issue in the latest CoreDNS based DNS plug-in. The update was entirely version bumps (Alpine 3.19 and CoreDNS 1.8.7).

Anything specific in the DNS plug-in logs during the high CPU usage periode?

plord12 commented 4 months ago

Looks like adguard DNS is slow to start on boot ... during that time I see lots of (with varying addresses) :

[INFO] 172.30.32.1:42679 - 54849 "A IN api.forecast.solar. udp 36 false 512" - - 0 6.002127419s
[ERROR] plugin/errors: 2 api.forecast.solar. A: read udp 172.30.32.3:57283->192.168.175.10:53: i/o timeout
plord12 commented 4 months ago

Oh, early in the dns logs I also see -

services-up: info: copying legacy longrun coredns (no readiness notification)
s6-rc: info: service legacy-services successfully started
[ERROR] plugin/mdns: could not locate primary interface due to: Lookup failed due to system error: Connection refused
plord12 commented 4 months ago

It it easy to rollback to the ealier dns version to test ?

agners commented 4 months ago

Yes you can use ha dns update --version 2023.06.2.

Meanwhile I tried to recreate your setup, but I have not observed elevated CPU load yet. However, I only set servers to the Home Assistant installation, I haven't reconfigured my DHCP (and hence locals is still my router).

plord12 commented 4 months ago

Yes you can use ha dns update --version 2023.06.2.

Thanks - I've not had to do that before.

However, even after this I still saw a case of high cpu and server eventually rebooting. So I'm wondering if there is a combination here.

Note that this is tough to debug ... I need the server running plus there is only a small window to run commands before the server reboots.

To keep me running, I've updated home assistant to reference my secondary DNS server (external to home assistant) so during boot hassio_dns isn't trying to query adguard.

Aside from possible hassio_dns bugs, I'm wondering if this configuration is a bad idea since there will always be a startup ordering issue. Maybe :

  1. hassio_dns should wait for adguard to be up (or, more generally, the upstream DNS server is working)
  2. move adguard to a different server to avoid boot up dependencies (would be a shame)
  3. if we have adguard, do we need hassio_dns running ? can we run with only adguard DNS ?
  4. do I need a faster server (more cores)
  5. is there better coredns config (that doesn't keep re-trying failed lookups, or at least a little slower)
agners commented 4 months ago

However, even after this I still saw a case of high cpu and server eventually rebooting. So I'm wondering if there is a combination here.

Thanks for confirming! So it seems this just coincided somehow? Did you do other changes to your system?

Note that there are also other issues reporting high CPU usage, e.g. #124.

Aside from possible hassio_dns bugs, I'm wondering if this configuration is a bad idea since there will always be a startup ordering issue. Maybe :

Yeah a system being it's own primary DNS server is generally not ideal indeed. But it is a popular setup, and afaik it works for other folks.

I just checked my local test installation with AdGuard, and now noticing elevated CPU usage as well. Looking at logs it seems it entered a loop of PTR requests:

Mar 13 09:05:05 ha-shelf3-odroid-c2 addon_a0d7b954_adguard[401]: 2024/03/13 10:05:05.023703 [error] dnsproxy: 172.30.32.3:53: response received over udp: "exchanging with 172.30.32.3:53 over udp: read udp 172.30.32.1:57046->172.30.32.3:53: i/o timeout"
Mar 13 09:05:05 ha-shelf3-odroid-c2 addon_a0d7b954_adguard[401]: 2024/03/13 10:05:05.023733 [error] dnsproxy: upstream 172.30.32.3:53 failed to exchange ;4.d.e.0.0.0.0.0.0.0.0.0.0.0.0.0.7.0.0.0.c.e.5.9.c.a.a.8.a.7.d.f.ip6.arpa.        IN         PTR in 2.002714299s: exchanging with 172.30.32.3:53 over udp: read udp 172.30.32.1:57046->172.30.32.3:53: i/o timeout
Mar 13 09:05:05 ha-shelf3-odroid-c2 hassio_dns[401]: [INFO] 127.0.0.1:48246 - 26210 "PTR IN 4.d.e.0.0.0.0.0.0.0.0.0.0.0.0.0.7.0.0.0.c.e.5.9.c.a.a.8.a.7.d.f.ip6.arpa. udp 101 true 2048" NXDOMAIN qr,rd,ra 90 0.00516157s
Mar 13 09:05:05 ha-shelf3-odroid-c2 hassio_dns[401]: [INFO] 172.30.32.1:51696 - 16043 "PTR IN 4.d.e.0.0.0.0.0.0.0.0.0.0.0.0.0.7.0.0.0.c.e.5.9.c.a.a.8.a.7.d.f.ip6.arpa. udp 101 true 2048" NXDOMAIN qr,rd,ra 90 6.012442759s

It seems to me the two DNS servers are referencing each other? :thinking: Indeed, I see 172.30.32.3:53 as upstream DNS server, which is CoreDNS. And since CoreDNS is pointing to the local system too, it seems my system entered a DNS loop :cry:

plord12 commented 4 months ago

It seems to me the two DNS servers are referencing each other? 🤔 Indeed, I see 172.30.32.3:53 as upstream DNS server, which is CoreDNS. And since CoreDNS is pointing to the local system too, it seems my system entered a DNS loop 😢

Ah, thanks for replicating. Glad I'm not alone !

agners commented 4 months ago

It seems that CoreDNS got picked up from the local /etc/resolv.conf. We set the IP address of the DNS plug-in (CoreDNS) as default name server for each add-on. That is also true for AdGuard, hence that potential of a loop.

Under Settings -> DNS Settings I have the default Quad 9 http server as upstream DNS server. But "Use private reverse DNS resolvers" is checked, and under "Private reverse DNS servers" the following is listed:

By default, AdGuard Home uses the following reverse DNS resolvers: "172.30.32.3:53".

It seem AdGuard automatically picked up the DNS server from /etc/resolv.conf, which caused the DNS loop.

Disabling "Use private reverse DNS resolvers" seems to fix the loop in my setup.

I've now set my router as private reverse DNS resolver, that should work too (as long as you don't set the Home Assistant machine adding the AdGuard add-on as upstream on the router as well :sweat_smile: .

Reading the documentation of AdGuard, it says:

Ensure your Home Assistant device has a static IP and static external DNS servers! This is important! You WILL end up having issues if you skip this step.

:man_shrugging:

plord12 commented 4 months ago

Disabling "Use private reverse DNS resolvers" seems to fix the loop in my setup.

Brilliant ! Works for me. I'm not have on the latest versions and no high cpu / reboots.

I'll raise an issue against the adguard plugin about Disabling "Use private reverse DNS resolvers".

plord12 commented 4 months ago

Also raised to https://github.com/hassio-addons/addon-adguard-home/issues/501

sarmbruster commented 4 months ago

I've opened a thread in HA community https://community.home-assistant.io/t/dns-not-running-weird-messages-in-protocols-error-relocation-bin-bash/703837

My assumption is that aarch64-hassio-dns:2023.03.0 image is not using the correct architecture for its binaries to run on a Odroid device.

The same seems to apply for ghcr.io/home-assistant/odroid-n2-homeassistant:2024.3.0 image as well.

agners commented 4 months ago

@sarmbruster your case seems to have a different root cause, and it seems to me that is limited to your instance. Therefor i don't think it belongs to the issue tracker here. I've responded on the community thread.

Tahutipai commented 4 months ago

Also had crazy high CPU usage etc.

Resolved issue by ssh'ing into HA and running: ha dns options --fallback=false

agners commented 3 months ago

As the OP's problem is caused by the AdGuard Home add-on, I am closing this issue.

@Tahutipai your issue is tracked by #90.