Closed aleks-mariusz closed 1 month ago
Below, I captured a tcpdump during the incident over the weekend, just of all icmp6 packets (the dhcpv6 traffic itself is not captured) - I am posting two versions, one annotated by myself (with the commentary of what i believe is happening), as well as an unannotated/unabridged version (if you don't want my comments/interpretation of the traffic to bias your investigation).
Note i am using a 'diff-like output' to colorize the different "speakers" of each ICMPv6 packet to make to easier to follow:
The hosts involved and appearing in the below captures are color-coded as follows (along with their IPv6 addresses, both link-local and global-unicast):
+ router-main:bd:10 - provides IPv6 connectivity normally, via SLAAC-enabling router-advertisements proposing the client uses DHCPv6
+ 2a02:aaaa:bbbb:2220:: - my statically assigned address for this VLAN's interface - what the HA vm starts impersonating
+ fe80::c0bf:3704:2f17:8093 - link-local stable-privacy address prior-to--reboot
+ fe80::6c0d:8563:da46:ca9c - link-local stable-privacy address after-reboot
+ fe80::213f:816a:a47b:30e4 - link-local stable-privacy address pre-issue-fixed
+ fe80::33b0:f83f:9c09:f500 - link-local stable-privacy address pre-issue-fixed
+ fe80::9bec:db29:40d6:3f66 - link-local stable-privacy address after-issue-fixed
- server-kvm_:e1:b8 - this is my NAS which is the hypervisor for the home-assistant VM
- fe80::d250:99ff:fe6f:e1b8 - link-local eui64-based address
- 2a02:aaaa:bbbb:2220:d250:99ff:fe6f:e1b8 - slaac-assigned eui64-based address
! vm-homeastn:a5:c6 - this is the actual VM running home-assistant OS
! fe80::a8b4:7a51:21a4:a9ba - link-local eui64-based address
! 2a02:aaaa:bbbb:2220::11 - dhcpv6-assigned global addresss
! 2a02:aaaa:bbbb:2220:6bf3:b959:29ea:d161 - slaac-assigned address
@@ macbook-pro:0c:20 - this is my macbook which is also impacted by this issue ---@@
@@ fe80::1430:f31f:50e3:e9d0 - link local ---@@
@@ 2a02:aaaa:bbbb:2220::1 - DHCPv6 assigned address ---@@
@@ 2a02:aaaa:bbbb:2220:7c31:b848:97e:7840 - SLAAC-assigned privacy address ---@@
# and finally this contains comments w/ my understanding of about what's happening on the following line
pre-14:00 - everything working normally 14:00:00.000 - the router has been rebooted 14:01:08.391 - the HA vm begins impersonating the router's assigned IPv6 gua address (answering to another host on network, a macbook) 14:01:13.500 - the router has sent its first ICMPv6 message since booting back up 14:01:52.257 - final HA-vm's impersonation of router, at this point it is rebooted. 14:25:04.935 - the router performs it's DAD again, successfully - things from then on continue to work normally. 14:25:06.154 - the HA vm starts coming back up 14:25:26.377 - the HA vm appears to try to impersonate again, but this may have been after the DAD timeout expired
Ok i feel a bit silly but, i've now tested this with a plain Ubuntu 22.04 VM also on the same KVM bridge acting the same way.. so it may not be HA-related afterall :-(
Ok i feel a bit silly but, i've now tested this with a plain Ubuntu 22.04 VM also on the same KVM bridge acting the same way.. so it may not be HA-related afterall :-(
I was going to say, when it comes to networking, we pretty much run vanilla Linux. Our kernel might have a slight different config, but I would not expect that this changes core IPv6 address assignment behavior. Ontop of that we use NetworkManager 1.44.2.
I've recently noticed that Docker actually blocks multicast forwarding on all bridges on the system. Depending on when the Docker daemon gets started, this can lead to IPv6 working at first, but stops working after a while (see https://github.com/moby/moby/issues/48365). But if you don't use Docker, this is unlikely related.
Maybe you accidentally caused a Layer 2 level loop or something :thinking: ? Make sure to enable STP etc.
Maybe you also want to try disabling multicast snooping. There are some reports of such issues in the Linux bug tracker (see https://bugzilla.kernel.org/show_bug.cgi?id=99081#c14).
But if you don't use Docker, this is unlikely related
wait, I was under the impression that Docker is always used on HA-os as the supervisor ends up starting all the components (including homeassistant-core) in separate containers.
Make sure to enable STP etc
Had a look, but it's enabled by default on the bridge automatically.
some reports of such issues in the Linux bug tracker
Thanks, I've looked at these already but their specific symptoms are not exactly like mine, where a random VM start impersonating another IPv6 identity on the network. I can only get this to happen with the router (e.g. rebooting one or the other VMs doesn't make it impersonate the other). Also thanks for telling me what entity is controlling the network. The Ubuntu VM i also saw doing this strangeness is using netplan and not NetworkManager. I still suspect there's something funky going on at the bridge-level unless it's a linux-kernel level issue (at this point, i'm just grasping at straws trying to find commonality).
Anyway thanks for entertaining my strange issue, even though it's very unlikely to be directly related to this project. I'll post on reddit or something under the KVM section, see if anyone has an idea there as well.
But if you don't use Docker, this is unlikely related
wait, I was under the impression that Docker is always used on HA-os as the supervisor ends up starting all the components (including homeassistant-core) in separate containers.
Yeah that is inside HAOS. But what I am talking about is on your Virtual Machine host: It can influence the bridge on the VM host system...
Ok so I have come to realize the underlying cause of for this weirdness. I appreciate this is not an HA related issue at all, but it was quite an interesting lesson in IPv6 networking, so i figure i'll share my findings, and express my sincere thanks to you folks (esp @agners) for letting me use this issues-tracker as forum for venting/sharing the tcpdump at least. It's been an excellent learning experience:
Cause: It's because i chose for the router's static assignment an IPv6 address that ends in just '::' in the first place, which is a reserved anycast address and seems bad things can happen when you use it for unicast purposes :man_facepalming:
Solution: I changed my ip6ifaceid (which is the "ipv6-address interface-id" - the part that generates the suffix of the IPv6 address) from the neat-looking '::' to anything else (i chose '::ffff') and the issue stopped. I had only chose '::' originally because it looked neat, without realizing the "ticking-timebomb" that i had introduced once i had VMs up on the network (and probably it could happen w/ any linux host really, VM or not).
Seems this [2xxx:aaaa:bbbb::]
i chose to use is a possibly a "reserved" anycast address, and for whatever reason, both my VMs have regardless (even after the fix) in their routing table this address:
user@vmname:~$ ip -6 route show table local | grep anycast | grep 2a02
anycast 2a02:aaaa:bbbb:2220:: dev ens2 proto kernel metric 0 pref medium
...at least until my router goes down, when it stays in only one VM's who started impostering the router's address.
I have no idea what mechanism triggers this (seemingly default) automated behaviour only once the router went down, so if anyone has any idea, i'd be game to understand the underlying behavioural-cause, but at least I have a fix for now.
Upon further research, this behaviour might be triggered by HAOS seemingly being configured "out-of-the-box" as an IPv6 router (sysctl
shows net.ipv6.conf.all.forwarding=1
).
Since I don't remember ever changing this, is this a default setting? I think this is what causes this VM to behave in this unexpected way - Interestingly, my other culprit VM (an Ubuntu 22.04 vanilla install) i also have (purposely) configured this way, albeit that change was intentional.
@agners any idea if/why is HAOS set up this way?
@agners any idea if/why is HAOS set up this way?
Yes, HAOS enables IPv6 forwarding by default. The reason is that it makes it possible to turn Home Assistant OS into a Thread Border router. See also the relevant PR #1832.
FWIW, I think recent Docker version enable IPv6 forwarding as well.
Do you know why/how that influences the address setup exactly? :thinking: Sorry, just now saw your previous comment. Interesting indeed :sweat_smile:
A while ago I also realized that enabling IPv6 routing affected the IPv6 router readability probe (see #2434). We currently carry a patch which reverts that behavior.
Cause: It's because i chose for the router's static assignment an IPv6 address that ends in just '::' in the first place, which is a reserved anycast address and seems bad things can happen when you use it for unicast purposes 🤦♂️
So that is OpenWrt right? So you used ::
as IPv6 suffix? Given that is a reserved anycast address, maybe OpenWrt should actually warn/disallow that setting? :thinking: :sweat_smile:
HAOS enables IPv6 forwarding by default. The reason is that it makes it possible to turn Home Assistant OS into a Thread Border router
interesting - i would have personally preferred this was only enabled if you actually need to use thread though? I get that thread relies on IPv6 - but i wager it's not the majority of HA users that actually have thread in their environment that it's configured as on-by-default out of the box, just my two cents tho? I can at least see where this came from though.
FWIW, I think recent Docker version enable IPv6 forwarding as well.
Truth be told, I've stopped using docker entirely, in favor of podman - i don't see a reason to rely on a third-party package with its own daemon when all its doing is relying on built-in linux kernel features (namespaces/cgroups).. I guess podman itself is an extra package to be installed, but at least it doesn't need an extra daemon running as root. Whether the same IPv6 issue exists with podman, i'm really not sure as i haven't played with IPv6 inside containers - i do know that podman requires a somewhat recent kernel (it mostly worked on el7's 3.10 but only for IPv4 - something newer would be needed for IPv6 i reckon), FWIW.
...maybe OpenWrt should actually warn/disallow that setting
Yes, the victim was an OpenWRT-based router - the UI actually has a default of ::1 - i overrode it thinking i was clever, but i guess it should probably warn the user, that's a great idea for a PR ! I have already updated the openwrt-docs at least about these findings... I'll ask if there is an appetite for a PR to put better guard-rails on the UI at least.
Describe the issue you are experiencing
I've just started playing with HomeAssistant, and so to dip my toes, I created a VM back in July on my NAS (which runs Linux, using libvirtd/kvm - While I haven't done anything with HASS (other than set up a TLS cert, install HACS and the ssh-addon), for the most part I'm pretty new to HA but quite comfortable w/ linux/containers/networking (it's my day job).
I have on my home-network dual-stack (IPv4 + IPv6), and I noticed that my IPv6 connectivity starts to act "funny" ever since around the time i set up HA, which seems was uncoincidental.. I have been having issues for months on my home network related to IPv6 connectivity breaking seemingly "randomly" (i only now just figured out what triggers it).
So of course I logged into my home-router (running OpenWRT) and I noticed the following log messages on my home-router:
What seems to be happening is that the router fails doing its attempt at duplicate-address detection because the mac-address of my HomeAssistant VM is advertising itself as the address the router is trying to use. I've of course double-and-triple-checked that the home-assistant VM is not configured to use that address.
Seems specifically only-triggered whenever i reboot my home-router (so it's something i do a few times a month to keep up to date with patches).. The router boots back up, it suddenly is unable to assign itself the configured IPv6 address, because for some reason, when doing DAD and it asks if anyone has the address, the home-assistant VM starts responding to the router's neighbor-solicitation requests (this use of neighbor solicitation is standard for duplicate-address-detection for any IPv6 host before it assigns itself an address, it asks if anyone else is using it first).
Now i've made sure the home-assistant has been allocated an entirely different static dhcpv6 assignment, and so it's unclear WHY it starts behaving this way once the router goes down. But because my home-router (when it boots back up) can no longer assign itself the IPv6 address to the VLAN interface for the network HA is on, all my hosts on the vlan no longer have working IPv6 due to DHCPv6 stopping working
The only fix is to reboot the HA vm guest.
What i'm aiming to find out is, why is HAOS doing this?
tcpdump captures in the next comment, otherwise i get a max-content error > 65536 characters long)
What operating system image do you use?
generic-x86-64 (Generic UEFI capable x86-64 systems)
What version of Home Assistant Operating System is installed?
Home Assistant OS 13.1
Did the problem occur after upgrading the Operating System?
No
Hardware details
output of `lscpu` on VM
``` # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 36 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel BIOS Vendor ID: Red Hat Model name: Westmere E56xx/L56xx/X56xx (IBRS update) BIOS Model name: RHEL-8.6.0 PC (Q35 + ICH9, 2009) CPU @ 2.0GHz BIOS CPU family: 1 CPU family: 6 Model: 44 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 Stepping: 1 BogoMIPS: 4799.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m ca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes rdrand hype rvisor lahf_lm 3dnowprefetch cpuid_fault pti ibrs ibpb stibp tsc_adjust smep erms arat umip md_clear arch_ca pabilities Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 256 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 32 MiB (8 instances) L3: 128 MiB (8 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Mitigation; PTE Inversion Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Meltdown: Mitigation; PTI Mmio stale data: Unknown: No mitigations Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Not affected Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointe r sanitization Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STI BP disabled; RSB filling; PBRSB-eIBRS Not affected; BH I Retpoline Srbds: Not affected Tsx async abort: Not affected ```output of `free` (memory-info) on VM
``` # free -m total used free shared buff/cache available Mem: 3900 662 1864 1 1373 3175 Swap: 1287 0 1287 ```output of `virsh dumpxml homeassistant` on hypervisor
```Steps to reproduce the issue
Anything in the Supervisor logs that might be useful for us?