home-assistant / plugin-dns

CoreDNS implementation for Home Assistant
Apache License 2.0
18 stars 13 forks source link

HOAS resolv.conf has wrong search path in homeassistant container #118

Open ToxicFrog opened 10 months ago

ToxicFrog commented 10 months ago

Describe the issue you are experiencing

I've been having DNS problems with a newly set up HOAS install, where it couldn't resolve any local hostnames. Looking at the logs showed it trying to resolve names like timelapse.local.hass.io rather than just timelapse or using the DNS search path provided by DHCP.

ha dns info is fine and nslookup works as expected on both the host and in the hassio_dns container. However, looking inside the main homeassistant container we see:

$ docker exec -it homeassistant cat /etc/resolv.conf
search local.hass.io
nameserver 172.32.30.3

That is definitely not the correct DNS search path, and it doesn't match the one in the host system or the DNS container! This looks similar to home-assistant/operating-system#454, but that was fixed years ago.

Furthermore, I can't even fix this by editing /etc/resolv.conf in the container, because it gets overwritten every time HA restarts. As a result, HA is basically nonfunctional for me right now.

What operating system image do you use?

ova (for Virtual Machines)

What version of Home Assistant Operating System is installed?

10.5

Did you upgrade the Operating System.

No

Steps to reproduce the issue

  1. Install HAOS in an environment where DHCP provides a local DNS search path.
  2. Configure HA with a hostname that matches that search path. (I suspect this part isn't necessary and you can name it whatever you like.)
  3. Add an integration like MPD using an unqualified hostname.
  4. Observe as it fails to talk to the device. Check the DNS logs and see lots of NXDOMAIN for hostname.local.hass.io rather than whatever your local DNS search path is.

Anything in the Supervisor logs that might be useful for us?

Nope. I was hoping for a nice "overwriting DNS configuration in container" smoking gun or something.

Anything in the Host logs that might be useful for us?

No.

System information

System Information

version core-2023.8.3
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.11.4
os_name Linux
os_version 6.1.45
arch x86_64
timezone America/Toronto
config_dir /config
Home Assistant Cloud logged_in | false -- | -- can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | ok
Home Assistant Supervisor host_os | Home Assistant OS 10.5 -- | -- update_channel | stable supervisor_version | supervisor-2023.08.1 agent_version | 1.5.1 docker_version | 23.0.6 disk_total | 30.8 GB disk_used | 4.9 GB healthy | true supported | true board | ova supervisor_api | ok version_api | ok installed_addons | File editor (5.6.0), Whisper (1.0.0), Piper (1.3.2)
Dashboards dashboards | 1 -- | -- resources | 0 mode | auto-gen
Recorder oldest_recorder_run | August 14, 2023 at 10:03 PM -- | -- current_recorder_run | August 23, 2023 at 11:51 PM estimated_db_size | 15.50 MiB database_engine | sqlite database_version | 3.41.2

Additional information

No response

ToxicFrog commented 10 months ago

As a workaround, you can automate the fixing of resolv.conf on startup. Replace example.net with your real DNS search path.

There's probably a more elegant way to do this that dynamically fetches the correct search path when the container starts up, but I don't know what it is.

/config/tools/fix-dns

#!/usr/bin/env bash

# For some reason sed -i doesn't work inside the container, so we need
# this little dance
sed -E 's,^search .*,search example.net,' /etc/resolv.conf > /tmp/$$
cat /tmp/$$ > /etc/resolv.conf
rm /tmp/$$

/config/configuration.yaml

shell_command:
  fix_dns: 'bash /config/tools/fix-dns'

And then in Automations, create one with the trigger "HomeAssistant starts" and the action "call service shell_command.fix_dns".

ToxicFrog commented 10 months ago

Update: the above doesn't work as well as I might hope, because some integrations fire off before the fix script runs -- so e.g. if you have a cmus media sink, it'll try to connect to it before resolv.conf is repaired, and fail. Some of these, like MQTT, will retry and succeed, but cmus doesn't seem to.

agners commented 10 months ago

The container in Home Assistant use the DNS plug-in which in turn uses CoreDNS to resolve hostnames. I am transferring the issue to that plug-in.

pvizeli commented 10 months ago

Hass.io is a closed system and container orchastrator. If you want to access an external system, use the full qualified name. That is per design and not a bug.

KevinCathcart commented 8 months ago

If you want to access an external system, use the full qualified name. That is per design and not a bug.

It may not be a bug, but allowing this to work could be be a desirable feature, because it would remove a difference between core/docker installs and HAOS, it resolves what looks like inconsistent behavior in HAOS, and it looks to be really easy to do, and quite low risk.

Currently using bare hostnames for external devices via DCHP provided DNS search paths works just fine with core and docker installs, but doesn't work with HAOS or Supervised installations. This adds undesirable friction for people who want to migrate to HAOS from core or container.

Furthermore, using raw hostnames for devices with HAOS sometimes seems to work, and sometimes doesn't. The reason for this is because it works for devices that support LLMNR, but not for others.

So how could this be enabled in a simple way with minimal risk? Well to find out, let's look at what happens if you try to resolve a single label name (myname) relative from within core or an addon running under supervisor.

  1. Musl or glibc will notice it is a single label label name, will see the searchpath specified in /etc/resolv.conf.
  2. It will try to resolve myname.local.hass.io., via DNS protocol talking to coredns.
    1. Coredns will notice the .local.hass.io suffix, and will try to look this up as a name of a container. This will fail, returning nxdomain.
  3. The libc will now try to resolve myname. via DNS protocol talking to coredns.
    1. This time the MDNS plugin will kick in, since this is a single label name.
    2. It will ask the host's systemd-resolved to resolve myname.
    3. systemd-resolved on the host will determine that the candidate protocols
      1. LLMNR is a candidate because the name is single label.
      2. MDNs is not a candidate
      3. DNS will be considered a candidate because the name is single label and a search list exists on the host.
    4. systemd-resolved will try the candidate protocols.
      1. LLMNR will only find the device if it supports LLMNR.
      2. DNS won't find because systemd-resolved was passed myname. with the trailing dot, which disables using the search path. Further, systemd-resolved will refuse to send A or AAAA queries for myname. via DNS protocol because the (strongly discouraged) ResolveUnicastSingleLabel setting is not enabled. If systemd-resolved had been passed myname (with no dot) instead, then it would have used the host's DHCP derived suffix list.
    5. The mdns plugin will return whatever systemd-resolved returned, without further fallbacks.
  4. musl/glibc will accept the result from coredns, since there is nothing left to try.

So if the mdns plugin had instead removed the "." suffix (with code like hostname = strings.TrimSuffix(hostname, ".")) before passing the name to systemd-resolved from the host, then the host's search suffixes would be available.

Why do I claim this is low risk? Well, first of all it cannot affect any single label names for containers, as those will get tried first. This change also cannot affect any queries that the mdns plugin declines to process, so is limited to just affecting mdns and single label names. systemd-resolved treats names ending with a dot and those without identically, except for dns search list processing, which only gets applied for single label names.