project-chip / rs-matter

Rust implementation of the Matter protocol. Status: Experimental
Apache License 2.0
310 stars 43 forks source link

IPv6-specific "Host is unreachable" error that exits the matter runtime #100

Closed jasta closed 11 months ago

jasta commented 11 months ago

Environment Chip: ESP32-C3-MINI-1 Hardware: ESP32-C3-DevKitM-1 Platform: esp-idf (Rust std)

Problem I likely have something misconfigured on my network causing IPv6 broadcasts to yield a surprising "Host is unreachable" error, however the more important issue is that the way the master future is structured in my example (and onoff_light) causes the entire Matter runtime to effectively shutdown and not automatically restart.

An abridged version of the log shows the issue:

I (7607) rs_matter::transport::core: Comissioning started
I (7617) rs_matter::transport::core: Creating queue for 1 exchanges
I (7617) rs_matter::transport::core: Creating 8 handlers
I (7627) rs_matter::transport::core: Handlers size: 9992
I (7637) rs_matter::transport::core: Transport: waiting for incoming packets
I (7647) rs_matter::transport::udp::async_io: Listening on [::]:5353
I (7647) rs_matter::transport::udp::async_io: Joined IPV6 multicast ff02::fb/2
I (7657) rs_matter::transport::udp::async_io: Joined IP multicast 224.0.0.251/192.168.86.32
I (7667) rs_matter::mdns::builtin: Broadcasting mDNS entry to 224.0.0.251:5353
I (7687) rs_matter::mdns::builtin: Broadcasting mDNS entry to ff02::fb:5353
W (7697) rs_matter::transport::udp::async_io: Error on the network: Os { code: 118, kind: HostUnreachable, message: "Host is unreachable" }
Error: Error::Network

The last line in particular appears to be coming from the master future in the onoff light example: https://github.com/project-chip/rs-matter/blob/main/examples/onoff_light/src/main.rs#L165

This is "fixed" by just disabling IPv6 for me, but I do think this highlights some bigger issues with the error handling robustness inside the runtime. In particular I'd expect that the IPv4 and IPv6 behaviour be separated into separate futures that can error out independently and that one reaching a terminal state wouldn't negatively impact the other. Further I think some measure of error handling policy is appropriate (Host is unreachable seems like it should probably be retryable for example). I could take a crack at a patch but I do worry based on the current state of the code that it might be a bit intrusive. Any guidance from the maintainers would be greatly appreciated before getting started!

Thanks again for this awesome project, it's renewed my interest big time in IoT :)

ivmarkov commented 11 months ago

The mDNS responder is doing broadcasting. In other words, it is not sending the UDP packet to a specific host, but rather, to the broadcast address ff02::fb which should always be available.

Not using ipv6 is less than ideal to put it mildly. The way Matter is implemented in the field (Google Home and I suspect others) is that it requires IPv6 connectivity - link-local IPv6 addresses suffice, but those are necessary. Moreover, the mDNS responder also needs ipv6 support. I was not able - without it - to get Google Home provisioning to complete.

So where I'm going is that if ipv6 (including for broadcasting) does not work for you, a hard failure for now is probably OK. It is another story why it fails, as per above. Can you pinpoint the exact code line where it fails?

jasta commented 11 months ago

So where I'm going is that if ipv6 (including for broadcasting) does not work for you, a hard failure for now is probably OK. It is another story why it fails, as per above. Can you pinpoint the exact code line where it fails?

Ack'd, I'll dig a little deeper why this isn't working. My network definitely should be supporting ipv6 as it's a Google WiFi mesh with no custom configuration which makes me think the fault lies somewhere in the rs-matter code somehow but we'll see...

ivmarkov commented 11 months ago

Might be... see, with linked-local ipv6 there is no need of any explicit "ipv6 support" per se. As in, there is no "dhcp" and you don't need a gateway as well.

What esp idf version are you using with the example? It should be 4.4.x which I know works, unless you've explicitly changed it...

jasta commented 11 months ago

Might be... see, with linked-local ipv6 there is no need of any explicit "ipv6 support" per se. As in, there is no "dhcp" and you don't need a gateway as well.

Ack'd that's good context for the debugging.

What esp idf version are you using with the example? It should be 4.4.x which I know works, unless you've explicitly changed it...

I'm using 5.0.x, but I can try dropping back to 4.4.x to confirm that's the issue. Another good clue, thanks!

jasta commented 11 months ago

Confirmed that 4.4.x fixed this specific issue. I'll try to debug deeper why 5.x would be broken in this way.

jasta commented 11 months ago

Digging deeper on why this doesn't work in release/v5.0 (and I presume v5.1 but that fails to compile with esp-idf-svc), the story here seems really hairy. I think espressif might've broke something when attempting to backport fixes to v4.4. After many hours of debugging, I am fairly confident the offending code is:

espressif/esp-idf/components/lwip @ release/v4.4: https://github.com/espressif/esp-idf/tree/release/v4.4/components/lwip https://github.com/espressif/esp-lwip/blob/4f24c9baf9101634b7c690802f424b197b3bb685/src/core/ipv6/ip6.c#L175-L185

espressif/esp-idf/components/lwip @ release/v5.0: https://github.com/espressif/esp-idf/tree/release/v5.0/components/lwip https://github.com/espressif/esp-lwip/blob/8dad8d3ee66840deee4acfc1601de4e396c594be/src/core/ipv6/ip6.c#L175-L177

No idea why these things are different or what actual diff introduced this inconsistency. The v4.4 branch of esp-lwip has only one commit and it seems unrelated like maybe somebody squashed a big merge into one commit (possibly on accident?). Even weirder is that I can't find any evidence of upstream or esp-lwip having code like this. There's also support in the v4.4 branch for IPV6_MULTICAST_IF (which probably would also fix the issue matter-rs is seeing), but that support isn't in upstream or v5.0/v5.1, or really anywhere I can see...

ivmarkov commented 11 months ago

My hypothesis: Rust STD's join_multicast_v6 is (partially) broken on the ESP IDF (join_multicast_v4 was totally broken and I had to fix it back in time) - in that it likely does not set correctly the proper ipv6 network interface. And therefore it hits the "fallback paths" in ESP IDF 4.4 and 5.0 which try to derive a network interface (and then use the default one on 4.4 and fail on 5.0).

One test we can try to do is "manually" re-implement join_multicast_v6 here, as I did for join_multicast_v4. If it works, next step is to upstream in libc correct signatures for setsockopt, associated constants, and maybe the ipv6_mreq structure.

jasta commented 11 months ago

My hypothesis: Rust STD's join_multicast_v6 is (partially) broken on the ESP IDF (join_multicast_v4 was totally broken and I had to fix it back in time) - in that it likely does not set correctly the proper ipv6 network interface. And therefore it hits the "fallback paths" in ESP IDF 4.4 and 5.0 which try to derive a network interface (and then use the default one on 4.4 and fail on 5.0).

I think you're right. I was able to find a commit that indicated the behavior I identified in 4.4 is actually wrong according to the standard and they tried to fix it but seemingly regressed this other behavior we care about. I'll do some more digging and see if any work arounds exist.

One test we can try to do is "manually" re-implement join_multicast_v6 here, as I did for join_multicast_v4. If it works, next step is to upstream in libc correct signatures for setsockopt, associated constants, and maybe the ipv6_mreq structure.

I don't think the IPV6_JOIN_GROUP even has the correct support in lwip to set the multicast interface as it probably should. So there's two unknowns that we need to work out then:

  1. How, if at all, can we work around this issue in newer lwip? I have been pretty deep in the code and I don't see any obvious hack that'll work given that IPV6_MULTICAST_IF support was mysteriously removed.

  2. What is the proper fix upstream to make it so we can remove any hack we find in (1)? The hard thing to discern from lwip code is what the intended implementation behaviour even is. That is, for Linux and OS X stacks, is it supposed to be that join_multicast_v6 enables routing of multicast destination IPs? Or is it expected that we call setsockopt with IPV6_MULTICAST_IF? Or something else I hadn't considered? In other words, which exact behaviour does lwip have wrong?

I'll think on this a little more and see if I can find something...

jasta commented 11 months ago

Nope, you were right, I think it's not setting the zone flag in the ip6_addr struct which is causing the route to fail. I'll prep a patch soon to fix it.

jasta commented 11 months ago

After some digging, I have good news. I believe the issue is that in lwip you have to call ip6_addr_set_zone on an ip6_addr_t (which has an extra u8 zone field at the end) that is then used to route packets. This is achieved using the scope_id field in SocketAddrV6. I believe this should be required/important on all platforms, it's just very likely that Linux has a less fragile heuristic to figure this out for you.

See the discussion on the scope_id field here: https://datatracker.ietf.org/doc/html/rfc2553#section-3.3.

So, good news addressing my unknowns above:

  1. We can just pass scope_id into the SocketAddr we use for send_to, confirmed this works.
  2. Nothing needed upstream, IPv6 scopes are implemented properly for newer versions of lwip (found in esp-idf 5.x)

I'll prep a PR to fix this