Closed darkxst closed 1 year ago
Also see https://github.com/home-assistant-libs/python-matter-server/issues/284 for example
Can you check what google.com resolves to? Does it resolve to different addresses when querying multiple times?
resolvectl query google.com
Such intermittent reachablity issues could also be caused by lower level issues (switching). Maybe use a different port or connect to the router directly to rule out such issues.
Also, check if the route change from time to time. That might indicate some type of issue as well (e.g. if another router announces itself an is deemed default router from time to time).
ip -6 route
Always resolves to the below address, I dont believe this changes
# resolvectl query google.com
google.com: 142.250.70.206 -- link: enp2s1
2404:6800:4015:800::200e -- link: enp2s1
default route, which is the correct link local of my router.
default via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 100 pref medium
I've not seen such intermittent issues on other any OS's, however will try connecting directly to the router.
Poking at the wireshark capture on the router, I did notice Router Solicitation messages that didnt appear to be getting answered.
How often are the pings failing? Is it regularly reproducible? Next time pinging google.com
fails, can you try pinging the IPv6 address directly (2404:6800:4015:800::200e
) to see if that makes a difference?
100% percent reproducible on boot. The pinging router thing fixes it for an hour or 2 then it breaks again.
Ive tried pinging IP's directly same issue.
Did that used to be a problem with HAOS 9.5?
We changed the IPv6 Neigbor Discover Protocol behavior to act more like a desktop in the sense that it should detect not reachable routers quickly (see https://github.com/home-assistant/operating-system/pull/2434).
Can you compare ip -6 neigh
before and after pinging the router?
I assume resolvectl query fritz.box
is resolving the link-local address?
(before)
# ip -6 neigh
fe80::39bb:5c1c:9ea7:dcfe dev enp2s1 lladdr 64:49:7d:8d:ac:9d router STALE
fe80::3374:88b2:689:1379 dev enp2s1 lladdr 00:0c:29:6f:dc:56 STALE
fe80::e228:6dff:fe88:bdf2 dev enp2s1 lladdr e0:28:6d:88:bd:f2 router REACHABLE
(after, there are additional routers)
# ip -6 neigh
fe80::39bb:5c1c:9ea7:dcfe dev enp2s1 lladdr 64:49:7d:8d:ac:9d router REACHABLE
fe80::3374:88b2:689:1379 dev enp2s1 lladdr 00:0c:29:6f:dc:56 STALE
2001:xxxx:xxxx:4166:ad01:e228:6dff:fe88:bdf2 dev enp2s1 lladdr e0:28:6d:88:bd:f2 router REACHABLE
fe80::e228:6dff:fe88:bdf2 dev enp2s1 lladdr e0:28:6d:88:bd:f2 router REACHABLE
It actually resolves the public ip on /64 subnet as well
# resolvectl query fritz.box
fritz.box: 192.168.178.1 -- link: enp2s1
fd00::e228:6dff:fe88:bdf2 -- link: enp2s1
2001:xxxx:xxxx:ad01:e228:6dff:fe88:bdf2 -- link: enp2s1
Did that used to be a problem with HAOS 9.5?
To be honest I never noticed any issue until recently with failing matter addon and around the same time timeouts importing blueprints. However I did quickly test 9.5 in a VM over the weekend and it was also the same issues.
Hm, I wonder if the global address is used in the after case, and that makes the router to properly route the package.
That said, link-local as router address should work.
What other Linux based issues did you test (and what version)? Can you double check if that is indeed only happens with HAOS?
@Jc2k maybe you have some ideas what this could be?
If possible I'd like to see complete snapshots of ip -6 a s
, ip -6 route
and ip -6 neigh
taken close together when things are working and not working. And confirm which router ip your are pinging (assuming the GUA, but want to confirm everything in one post).
@jc2k please see attached outputs HA on clean boot. ha_ipv6logs_before.txt
HA after pinging router on the global 2001:xxxx address (pings to the router link local fail on HA VM) ha_ipv6logs_after.txt
For reference a Debian Unstable VM on clean boot, ipv6 seemingly works fine here, however I am seeing duplicate responses to pings debian_unstable_ipv6.txt
Can you post the same before and after logs for the "working" system.
Was it a debian server install, and were you using /etc/network/interfaces or systemd-networkd to configure its networking? Could you test with a desktop linux vm that uses Network Manager? I assume GNOME on Debian does. Ubuntu GNOME desktop definitely does. (HAOS uses NetworkManager, so if if we can cause NetworkManager to fail in another distro it would be useful for isolating the problem).
You have no route table changes (i won't to rule out some sort of icmp redirect, which can inject temporary routes).
vethee34b97
did disappear between runs, that presumably a container exiting. I can't see how thats related.
Looking at the route table:
The "metrics" for the default ipv6 route seems very high. Common values are 100 or 256. I don't think i've seen 20100 before. While i don't think that's the root cause, it makes me what to dig into where that is coming from - do you know if your router is doing DHCP6 or is it doing route advertisements? ("proto ra" indicates route advertisements, but that can be wrong).
2001:xxxx:xxxx:ad00::/64 dev enp2s1 proto ra metric 100 pref medium
2001:xxxx:xxxx:ad00::/56 via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 100 pref medium
...
default via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 20100 pref medium
The 2nd route threw me, but i think its fine. It's superfluous, the /64 route has a higher prirority than it and its (apart from metric) identical to the default route.
The only thing that seems to change is that neighbour cache entry, which is already discussed:
2001:xxxx:xxxx:ad00:e228:6dff:fe88:bdf2 dev enp2s1 lladdr e0:28:6d:88:bd:f2 router STALE
Which obviously exists because you pinged it.
To be absolutely crystal clear, you find that after pinging the router, the link local address works, and routing normal ipv6 traffic works. After the the neighbour cache record expires, this stops and no traffic is route again?
The duplicate pings do sound like you have wider network problems though.
Given you have the same result on 2 different ipv6 addresses and its the fritzbox, it seems unlikely that its a configuration error (common cause the dupes is just mis-addressing a device or vm).
It's also interesting that the ttl's don't match. The ttl is decremented on every hop - this implies the icmp packet was both "bridged" AND "routed"
Can you ping the ipv6 address of your HAOS box from that VM and get dupes? What about e.g. google? Do you get dupes for your router/haos/google on ipv4 addresses?
Was it a debian server install
It is Debian with GNOME Desktop, so using systemd and NetworkManager. With all default auto settings for networking. i also have Ubuntu GNOME based VM that ipv6 traffic works fine on.
do you know if your router is doing DHCP6 or is it doing route advertisements?
The router (fritz box 7490) is setup with default settings for native ipv6. I believe it is only using DHCP6 for DNS addresses and otherwise using router advertisements. I have a /56 static prefix on my connection, but the router chops out the 2001::ad00:: /64 prefix from this for the LAN.
To be absolutely crystal clear, you find that after pinging the router, the link local address works, and routing normal ipv6 traffic works. After the the neighbour cache record expires, this stops and no traffic is route again?
After pinging the router, normal routing and ipv6 traffic works. I can never ping the link local address of the router from HA (but can from Debian). and yes once the cache expires no internet traffic is routed again. At this point I can still ping LAN addresses, however I dont seem to be able to ping the router anymore after cache expires (only just noticed this).
It's also interesting that the ttl's don't match. The ttl is decremented on every hop - this implies the icmp packet was both "bridged" AND "routed"
The VM's are using bridged network adapters, I dont think these are routed by the host.
Can you ping the ipv6 address of your HAOS box from that VM and get dupes? What about e.g. google? Do you get dupes for your router/haos/google on ipv4 addresses?
There are no dupes when pinging other hosts on my LAN including the HAOS VM. Google ipv6 has the same dupes though. No dupes ever pinging on ipv4 addresses and I also dont get dupes pinging ipv6 google from the Host Ubuntu machine.
Here are before and afters for the Debian VM debian_unstable_ipv6.txt
The neighbour cache entries match, and aren't failed.
2001:xxxx:xxxx:ad00::/64 dev ens33 proto ra metric 100 pref medium
2001:xxxx:xxxx:ad00::/56 via fe80::e228:6dff:fe88:bdf2 dev ens33 proto ra metric 100 pref medium
...
default via fe80::e228:6dff:fe88:bdf2 dev ens33 proto ra metric 100 pref medium
vs
2001:xxxx:xxxx:ad00::/64 dev enp2s1 proto ra metric 100 pref medium
2001:xxxx:xxxx:ad00::/56 via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 100 pref medium
...
default via fe80::e228:6dff:fe88:bdf2 dev enp2s1 proto ra metric 20100 pref medium
The only thing that stands out is the metric being werid. Is it actually 20100, or was it the victim of a find and replace when removing personal identifiers?
Nothing is standing out, i would expect this to work.
Depending on the virtualisation stack you are using i would be looking into tcpdump on the host now. By watching the host bridge you should be able to answer: Is the traffic leaving HAOS at all? It could even by that a reply is getting to the VM host, but not the guest.
Likewise, for the "healthy" but not really healthy VM, you should be able to verify that only one ping is leaving the VM and that 2 replies are arriving at your VM host.
If you can get pcaps of failing pings as seen on the VM host, that would be ideal.
This very much sounds like a L2 issue to me. Do you happen to have multiple network interfaces on your VM host?
Also check that all VMs have their own MAC address (sometimes these get cloned accidentally).
Same issue here
Is it actually 20100
yes this is real not a search and replace error
I will try mess around with tcpdump and pcaps a bit later
Is it actually 20100
I think this is caused by NetworkManager de-prioritising the route as it failed some connectivity check, so it adds 20000 to the metric.
Likewise, for the "healthy" but not really healthy VM, you should be able to verify that only one ping is leaving the VM and that 2 replies are arriving at your VM host.
I have not been able to reproduce the duplicate ping responses again..
Do you happen to have multiple network interfaces on your VM host?
Only one physical network interface, and a bunch of virtual ones such as docker.
I selected "Allow All" on Promiscuous Mode (VM Settings -> Network -> Advanced) and now ipv6 github certs are reachable
@darkxst
What operating system image do you use?
generic-x86-64 (Generic UEFI capable x86-64 systems)
Maybe you've chosen the wrong option here, but just FIY: For all virtualization environment the OVA image is recommended.
What Hypervisor and version are you using?
@djandrew2005
Same issue here
Which part of this issue exactly was the same for you (did GitHub not work at first and then started working after pinging your router)? What Hypervisor and version are you using?
I am using Vmware 17.0.2 on Linux host.
Yes I selected the wrong option, I would have installed HA using the ova vmdk image, that would have been v9 and upgraded to v10 though.
@darkxst can you try @djandrew2005's suggestion? Since this option influences L2 behavior, I can imagine that this could also change things in your case.
I dont see that setting that djandrew mentions, I think that might be a Windows host thing.
It's not Windows thing, ESXi at least has it. It is a little hidden:
@darkxst have you been able to find that setting or otherwise solve this problem?
Not solved yet, I was not able to find that setting on the Linux version. From what little I could find, I believe promiscuous mode should be enabled provided the user has access to /dev/vmnet0.
I will dig into this again soon...
After a recent firmware update to my router, I can now reproduce this issue in Debian/Ubuntu. No idea what is happening but I ended up solving this by adding an IPv6 NDP proxy on the VM host.
On non-virtualized Debian/Ubuntu or virtualized on the same Vmware 17.0.2 virtualization host?
Maybe also worth escalating with Vmware :thinking:
There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.
Describe the issue you are experiencing
I am having issues with IPv6 internet connectivity from within HA. DNS and ipv6 across local LAN is working as expected. There is no internet connectivity via IPv6.
I am running this instance in a VM using bridged networking. HA ipv6 is configured to auto. Also tested on a clean install of HA 9.5 which also has the same issue.
I have tested ipv6 on my network using Ubuntu, Windows 11 and variety of debian/ubuntu based VM's and all of those work perfectly. So suspect this is an issue with the config in HA or perhaps just some default settings in Alpine linux are different.
I have noticed it affecting the following
What operating system image do you use?
generic-x86-64 (Generic UEFI capable x86-64 systems)
What version of Home Assistant Operating System is installed?
10.1
Did you upgrade the Operating System.
Yes
Steps to reproduce the issue
On boot there is no ipv6 connectivity
If I do an outbound ping from HA to the router, ipv6 starts working for some period of time (if I come back sometime later it will have stopped working again.)
Anything in the Supervisor logs that might be useful for us?
Anything in the Host logs that might be useful for us?
System information
System Information
Home Assistant Cloud
logged_in | false -- | -- can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | okHome Assistant Supervisor
host_os | Home Assistant OS 10.1 -- | -- update_channel | beta supervisor_version | supervisor-2023.04.1 agent_version | 1.5.1 docker_version | 23.0.3 disk_total | 30.8 GB disk_used | 20.0 GB healthy | true supported | true board | ova supervisor_api | ok version_api | ok installed_addons | File editor (5.6.0), Terminal & SSH (9.7.0), Studio Code Server (5.5.7), ESPHome (2023.4.4), Mosquitto broker (6.2.1), Silicon Labs Multiprotocol (1.1.2), Matter Server (4.3.1), Cloudflared (4.1.5), Zigbee2MQTT (1.30.4-1), ESPHome (beta) (2023.5.0b3)Dashboards
dashboards | 2 -- | -- resources | 0 views | 1 mode | storageRecorder
oldest_recorder_run | May 5, 2023 at 11:05 PM -- | -- current_recorder_run | May 16, 2023 at 4:19 PM estimated_db_size | 220.26 MiB database_engine | sqlite database_version | 3.40.1Additional information