IPv6 internet connectivity issues

darkxst commented 1 year ago

Describe the issue you are experiencing

I am having issues with IPv6 internet connectivity from within HA. DNS and ipv6 across local LAN is working as expected. There is no internet connectivity via IPv6.

I am running this instance in a VM using bridged networking. HA ipv6 is configured to auto. Also tested on a clean install of HA 9.5 which also has the same issue.

I have tested ipv6 on my network using Ubuntu, Windows 11 and variety of debian/ubuntu based VM's and all of those work perfectly. So suspect this is an issue with the config in HA or perhaps just some default settings in Alpine linux are different.

I have noticed it affecting the following

Import of blueprints from github
Startup of Matter add-on failing to download certificates from github However updates etc are working fine

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

10.1

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

On boot there is no ipv6 connectivity

# ping -6 google.com
No response from google.com

If I do an outbound ping from HA to the router, ipv6 starts working for some period of time (if I come back sometime later it will have stopped working again.)

# ping -6 fritz.box
fritz.box is alive!
# ping -6 google.com
google.com is alive!

Anything in the Supervisor logs that might be useful for us?

None, downloads fail with Timeout

Anything in the Host logs that might be useful for us?

None

System information

System Information

version	core-2023.5.3
installation_type	Home Assistant OS
dev	false
hassio	true
docker	true
user	root
virtualenv	false
python_version	3.10.11
os_name	Linux
os_version	6.1.25
arch	x86_64
timezone	Australia/Sydney
config_dir	/config

Home Assistant Cloud

Home Assistant Supervisor

Dashboards

Recorder

Additional information

# ha network info
docker:
  address: 172.30.32.0/23
  dns: 172.30.32.3
  gateway: 172.30.32.1
  interface: hassio
host_internet: true
interfaces:
- connected: true
  enabled: true
  interface: enp2s1
  ipv4:
    address:
    - 192.168.178.52/24
    gateway: 192.168.178.1
    method: auto
    nameservers:
    - 192.168.178.1
    ready: true
  ipv6:
    address:
    - 2001:aab8:4166:ad01:6f7d:1fdb:1638:350f/64
    - fd86:92e7:d139:961b:115f:8bf2:dd4:20c1/64
    - fe80::9707:fb1f:65a3:84ab/64
    gateway: fe80::e228:6dff:fe88:bdf2
    method: auto
    nameservers:
    - fd00::e228:6dff:fe88:bdf2
    ready: true
  primary: true
  type: ethernet
  vlan: null
  wifi: null
supervisor_internet: true

darkxst commented 1 year ago

Also see https://github.com/home-assistant-libs/python-matter-server/issues/284 for example

agners commented 1 year ago

Can you check what google.com resolves to? Does it resolve to different addresses when querying multiple times?

resolvectl query google.com

Such intermittent reachablity issues could also be caused by lower level issues (switching). Maybe use a different port or connect to the router directly to rule out such issues.

Also, check if the route change from time to time. That might indicate some type of issue as well (e.g. if another router announces itself an is deemed default router from time to time).

ip -6 route

darkxst commented 1 year ago

Always resolves to the below address, I dont believe this changes

# resolvectl query google.com
google.com: 142.250.70.206                     -- link: enp2s1
            2404:6800:4015:800::200e           -- link: enp2s1

default route, which is the correct link local of my router. default via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 100 pref medium

I've not seen such intermittent issues on other any OS's, however will try connecting directly to the router.

Poking at the wireshark capture on the router, I did notice Router Solicitation messages that didnt appear to be getting answered.

agners commented 1 year ago

How often are the pings failing? Is it regularly reproducible? Next time pinging google.com fails, can you try pinging the IPv6 address directly (2404:6800:4015:800::200e) to see if that makes a difference?

darkxst commented 1 year ago

100% percent reproducible on boot. The pinging router thing fixes it for an hour or 2 then it breaks again.

Ive tried pinging IP's directly same issue.

agners commented 1 year ago

Did that used to be a problem with HAOS 9.5?

We changed the IPv6 Neigbor Discover Protocol behavior to act more like a desktop in the sense that it should detect not reachable routers quickly (see https://github.com/home-assistant/operating-system/pull/2434).

Can you compare ip -6 neigh before and after pinging the router?

agners commented 1 year ago

I assume resolvectl query fritz.box is resolving the link-local address?

darkxst commented 1 year ago

(before)
# ip -6 neigh
fe80::39bb:5c1c:9ea7:dcfe dev enp2s1 lladdr 64:49:7d:8d:ac:9d router STALE 
fe80::3374:88b2:689:1379 dev enp2s1 lladdr 00:0c:29:6f:dc:56 STALE 
fe80::e228:6dff:fe88:bdf2 dev enp2s1 lladdr e0:28:6d:88:bd:f2 router REACHABLE 

(after, there are additional routers)
# ip -6 neigh
fe80::39bb:5c1c:9ea7:dcfe dev enp2s1 lladdr 64:49:7d:8d:ac:9d router REACHABLE 
fe80::3374:88b2:689:1379 dev enp2s1 lladdr 00:0c:29:6f:dc:56 STALE 
2001:xxxx:xxxx:4166:ad01:e228:6dff:fe88:bdf2 dev enp2s1 lladdr e0:28:6d:88:bd:f2 router REACHABLE 
fe80::e228:6dff:fe88:bdf2 dev enp2s1 lladdr e0:28:6d:88:bd:f2 router REACHABLE

It actually resolves the public ip on /64 subnet as well

# resolvectl query fritz.box
fritz.box: 192.168.178.1                       -- link: enp2s1
           fd00::e228:6dff:fe88:bdf2           -- link: enp2s1
           2001:xxxx:xxxx:ad01:e228:6dff:fe88:bdf2 -- link: enp2s1

darkxst commented 1 year ago

Did that used to be a problem with HAOS 9.5?

To be honest I never noticed any issue until recently with failing matter addon and around the same time timeouts importing blueprints. However I did quickly test 9.5 in a VM over the weekend and it was also the same issues.

agners commented 1 year ago

Hm, I wonder if the global address is used in the after case, and that makes the router to properly route the package.

That said, link-local as router address should work.

What other Linux based issues did you test (and what version)? Can you double check if that is indeed only happens with HAOS?

agners commented 1 year ago

@Jc2k maybe you have some ideas what this could be?

Jc2k commented 1 year ago

If possible I'd like to see complete snapshots of ip -6 a s, ip -6 route and ip -6 neigh taken close together when things are working and not working. And confirm which router ip your are pinging (assuming the GUA, but want to confirm everything in one post).

darkxst commented 1 year ago

@jc2k please see attached outputs HA on clean boot. ha_ipv6logs_before.txt

HA after pinging router on the global 2001:xxxx address (pings to the router link local fail on HA VM) ha_ipv6logs_after.txt

For reference a Debian Unstable VM on clean boot, ipv6 seemingly works fine here, however I am seeing duplicate responses to pings debian_unstable_ipv6.txt

Jc2k commented 1 year ago

Can you post the same before and after logs for the "working" system.

Was it a debian server install, and were you using /etc/network/interfaces or systemd-networkd to configure its networking? Could you test with a desktop linux vm that uses Network Manager? I assume GNOME on Debian does. Ubuntu GNOME desktop definitely does. (HAOS uses NetworkManager, so if if we can cause NetworkManager to fail in another distro it would be useful for isolating the problem).

You have no route table changes (i won't to rule out some sort of icmp redirect, which can inject temporary routes).

vethee34b97 did disappear between runs, that presumably a container exiting. I can't see how thats related.

Looking at the route table:

The "metrics" for the default ipv6 route seems very high. Common values are 100 or 256. I don't think i've seen 20100 before. While i don't think that's the root cause, it makes me what to dig into where that is coming from - do you know if your router is doing DHCP6 or is it doing route advertisements? ("proto ra" indicates route advertisements, but that can be wrong).

2001:xxxx:xxxx:ad00::/64 dev enp2s1 proto ra metric 100 pref medium
2001:xxxx:xxxx:ad00::/56 via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 100 pref medium
...
default via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 20100 pref medium

The 2nd route threw me, but i think its fine. It's superfluous, the /64 route has a higher prirority than it and its (apart from metric) identical to the default route.

The only thing that seems to change is that neighbour cache entry, which is already discussed:

2001:xxxx:xxxx:ad00:e228:6dff:fe88:bdf2 dev enp2s1 lladdr e0:28:6d:88:bd:f2 router STALE

Which obviously exists because you pinged it.

To be absolutely crystal clear, you find that after pinging the router, the link local address works, and routing normal ipv6 traffic works. After the the neighbour cache record expires, this stops and no traffic is route again?

Jc2k commented 1 year ago

The duplicate pings do sound like you have wider network problems though.

Given you have the same result on 2 different ipv6 addresses and its the fritzbox, it seems unlikely that its a configuration error (common cause the dupes is just mis-addressing a device or vm).

It's also interesting that the ttl's don't match. The ttl is decremented on every hop - this implies the icmp packet was both "bridged" AND "routed"

Can you ping the ipv6 address of your HAOS box from that VM and get dupes? What about e.g. google? Do you get dupes for your router/haos/google on ipv4 addresses?

darkxst commented 1 year ago

Was it a debian server install

It is Debian with GNOME Desktop, so using systemd and NetworkManager. With all default auto settings for networking. i also have Ubuntu GNOME based VM that ipv6 traffic works fine on.

do you know if your router is doing DHCP6 or is it doing route advertisements?

The router (fritz box 7490) is setup with default settings for native ipv6. I believe it is only using DHCP6 for DNS addresses and otherwise using router advertisements. I have a /56 static prefix on my connection, but the router chops out the 2001::ad00:: /64 prefix from this for the LAN.

To be absolutely crystal clear, you find that after pinging the router, the link local address works, and routing normal ipv6 traffic works. After the the neighbour cache record expires, this stops and no traffic is route again?

After pinging the router, normal routing and ipv6 traffic works. I can never ping the link local address of the router from HA (but can from Debian). and yes once the cache expires no internet traffic is routed again. At this point I can still ping LAN addresses, however I dont seem to be able to ping the router anymore after cache expires (only just noticed this).

It's also interesting that the ttl's don't match. The ttl is decremented on every hop - this implies the icmp packet was both "bridged" AND "routed"

The VM's are using bridged network adapters, I dont think these are routed by the host.

Can you ping the ipv6 address of your HAOS box from that VM and get dupes? What about e.g. google? Do you get dupes for your router/haos/google on ipv4 addresses?

There are no dupes when pinging other hosts on my LAN including the HAOS VM. Google ipv6 has the same dupes though. No dupes ever pinging on ipv4 addresses and I also dont get dupes pinging ipv6 google from the Host Ubuntu machine.

Here are before and afters for the Debian VM debian_unstable_ipv6.txt

debian_unstable_ipv6_after.txt

Jc2k commented 1 year ago

The neighbour cache entries match, and aren't failed.

2001:xxxx:xxxx:ad00::/64 dev ens33 proto ra metric 100 pref medium
2001:xxxx:xxxx:ad00::/56 via fe80::e228:6dff:fe88:bdf2 dev ens33 proto ra metric 100 pref medium
...
default via fe80::e228:6dff:fe88:bdf2 dev ens33 proto ra metric 100 pref medium

vs

2001:xxxx:xxxx:ad00::/64 dev enp2s1 proto ra metric 100 pref medium
2001:xxxx:xxxx:ad00::/56 via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 100 pref medium
...
default via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 20100 pref medium

The only thing that stands out is the metric being werid. Is it actually 20100, or was it the victim of a find and replace when removing personal identifiers?

Nothing is standing out, i would expect this to work.

Depending on the virtualisation stack you are using i would be looking into tcpdump on the host now. By watching the host bridge you should be able to answer: Is the traffic leaving HAOS at all? It could even by that a reply is getting to the VM host, but not the guest.

Likewise, for the "healthy" but not really healthy VM, you should be able to verify that only one ping is leaving the VM and that 2 replies are arriving at your VM host.

If you can get pcaps of failing pings as seen on the VM host, that would be ideal.

agners commented 1 year ago

This very much sounds like a L2 issue to me. Do you happen to have multiple network interfaces on your VM host?

Also check that all VMs have their own MAC address (sometimes these get cloned accidentally).

djandrew2005 commented 1 year ago

Same issue here

darkxst commented 1 year ago

Is it actually 20100

yes this is real not a search and replace error

I will try mess around with tcpdump and pcaps a bit later

darkxst commented 1 year ago

Is it actually 20100

I think this is caused by NetworkManager de-prioritising the route as it failed some connectivity check, so it adds 20000 to the metric.

Likewise, for the "healthy" but not really healthy VM, you should be able to verify that only one ping is leaving the VM and that 2 replies are arriving at your VM host.

I have not been able to reproduce the duplicate ping responses again..

Do you happen to have multiple network interfaces on your VM host?

Only one physical network interface, and a bunch of virtual ones such as docker.

djandrew2005 commented 1 year ago

I selected "Allow All" on Promiscuous Mode (VM Settings -> Network -> Advanced) and now ipv6 github certs are reachable

agners commented 1 year ago

@darkxst

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

Maybe you've chosen the wrong option here, but just FIY: For all virtualization environment the OVA image is recommended.

What Hypervisor and version are you using?

@djandrew2005

Same issue here

Which part of this issue exactly was the same for you (did GitHub not work at first and then started working after pinging your router)? What Hypervisor and version are you using?

darkxst commented 1 year ago

I am using Vmware 17.0.2 on Linux host.

Yes I selected the wrong option, I would have installed HA using the ova vmdk image, that would have been v9 and upgraded to v10 though.

agners commented 1 year ago

@darkxst can you try @djandrew2005's suggestion? Since this option influences L2 behavior, I can imagine that this could also change things in your case.

darkxst commented 1 year ago

I dont see that setting that djandrew mentions, I think that might be a Windows host thing.

Jc2k commented 1 year ago

It's not Windows thing, ESXi at least has it. It is a little hidden:

https://kb.vmware.com/s/article/1004099

agners commented 1 year ago

@darkxst have you been able to find that setting or otherwise solve this problem?

darkxst commented 1 year ago

Not solved yet, I was not able to find that setting on the Linux version. From what little I could find, I believe promiscuous mode should be enabled provided the user has access to /dev/vmnet0.

I will dig into this again soon...

darkxst commented 1 year ago

After a recent firmware update to my router, I can now reproduce this issue in Debian/Ubuntu. No idea what is happening but I ended up solving this by adding an IPv6 NDP proxy on the VM host.

agners commented 1 year ago

On non-virtualized Debian/Ubuntu or virtualized on the same Vmware 17.0.2 virtualization host?

Maybe also worth escalating with Vmware :thinking:

github-actions[bot] commented 1 year ago

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

home-assistant / operating-system