project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.34k stars 1.97k forks source link

IPv6 multicast and broken APs #22380

Open jonsmirl opened 2 years ago

jonsmirl commented 2 years ago

Is there a plan for dealing with APs that don't implement IPv6 correctly and don't work? I may have hit this on my home network. I am working with Espressif to debug it. They have also hit this on their internal networks.

First idea -- commissioning should include a test of IPv6 multicast. Testing this during commissioning allows the device to be marked as having broken IPv6 multicast.

Several ideas come to mind.. 1) Fallback to IPv4 2) Follow up with unicast on group commands. I don't believe Matter knows who the members of the group are, so there is no way to follow up with unicast.
3) Use broadcast instead of multicast. Broadcast appears to be working on these broken networks.

My Insteons know their group membership. They send the group command first, and then they use unicast to each node to verify that the group command was received. Matter has another option, it could set up notifications on the grouped devices and then use unicast to fix up a missing notification.

My suspect problem APs are over ten years old. It would be real useful to have someway to identify broken APs. For example -- inject multicast packets with Ethernet, and the walk around with a phone and make sure you can receive them from all of the APs. I have a mix of APs and the newest ones don't have issues. All of these APs are on a single subnet.

bzbarsky-apple commented 2 years ago

@Abhayakara

jonsmirl commented 2 years ago

At the very least we should test IPv6 multicast during commissioning and then refuse to commission if it is broken. And then somehow this failure needs to be communicated to the homeowner informing them that they need their router firmware updated or possibly buy a new one.

My Cisco/Linksys APs are too old for firmware updates, They are first gen dual band models which until today worked fine. I guess I am buying some new routers.

Abhayakara commented 2 years ago

It would help to know more details about what exactly isn't working. I haven't ever actually received a bug report about this, which doesn't mean nobody's having problems, but it is curious that we aren't seeing more reports of problems from the field. We do see occasionally buggy behavior WRT multicast on WiFi mesh routers, but that doesn't seem to be what you're running into here. For the WiFi mesh case, I think the right approach is to get the vendors to fix their broken software, rather than adding complexity to work around the brokenness.

Abhayakara commented 2 years ago

BTW, perhaps a better way to frame this is simply that if you have an AP that has firmware that can't be upgraded, then despite the fact that it's wasteful to chuck things, you're actually putting yourself at risk by continuing to use it. So enabling people to use routers with non-upgradeable firmware seems like something we shouldn't spend effort on.

If it were going to cause a serious problem, then maybe we ought to reconsider, but at present we have no evidence that this is the case. There are a lot of Thread BRs in the field at this point, so if this is commonplace, we ought to be seeing more bug reports. I'm not seeing them, and to the best of my knowledge vendors we talk to aren't seeing them.

jonsmirl commented 2 years ago

Here is the bug in the Espressif system. https://github.com/espressif/esp-matter/issues/47

It's no big deal to buy a new AP/router. The problem here is communicate to the homeowner that they need to do so. So we need to identify that it is broken and then recommend what to do.

In my personal case, I know for sure IPv6 multicast is not working. But I have not been able to track down exactly why it is broken.

Abhayakara commented 2 years ago

Thanks!

Yes, we should definitely be able to detect the broken network. Reporting it to the user could be problematic—in the case of the Espressif bug, if it's that the APs are not forwarding multicast to the backbone, or vice versa, what advice do you give them? I mean, you can say "your network is broken," but that's not actionable. So that's a whole project.

That said, multi-AP same-SSID has never worked well. Partly it's issues on hosts that do dumb things like bond to the distant AP and then have issues receiving packets. Well, actually that's the only problem I've been able to identify, although I don't doubt that there could be others.

In any case, the multi-AP use case is not our typical use case, and people who have this kind of setup are more likely than average to be able to get help, so maybe "multicast on your network is broken" is enough feedback for this particular case.

Ultimately if we succeed with the Matter AP initiative, we ought to be able to get rid of a lot of multicast, and this would also give us a venue for getting vendors to fix things with their Matter-certified APs if we run into problems.

jonsmirl commented 2 years ago

Some type of testing app is needed. Inject multi-casts into the network, and then run around with your phone and make sure it can receive them everywhere. If the phone can attach to the AP and then not receive the multi-cast, then the AP is broken.

Abhayakara commented 2 years ago

I'm not sure how to make one that works well enough to be useful, but that's a good idea. Possibly we could identify people at Apple and Google who could do it.

jonsmirl commented 2 years ago

We just got around to implementing group support in our app this week. Lack of bug reports on this may be because very few people are testing groups outside of controlled test networks.

Abhayakara commented 2 years ago

So multicast to accessories, as opposed to for service discovery or neighbor discovery?

jonsmirl commented 2 years ago

We are working on this.... grouped-dimmers

Abhayakara commented 2 years ago

Hm. Well. Depending on the speed of the network, it's probably more robust to do unicast for that anyway. You're going to need all the devices to confirm anyway, and multicast is both less reliable and more bandwidth-intensive (on WiFi at least) so you'd have to have a lot of lights and very reliable multicast to get any kind of benefit from it.

Abhayakara commented 2 years ago

I guess it might be a win to do multicast once and then retransmit unicast per-device for the ones that don't ack within a very narrow window. If multicast winds up costing you more than 100ms, though, on a flat network you've probably already lost out. Sorry if this is obvious—I haven't had time to follow the discussion on this, although I was aware it was going on.

jonsmirl commented 2 years ago

The point of the wanting the group is to stop popcorning. Popcorning is very visible and EVERYONE complains about it. Popcorning is where you can see the lights being turned on in sequence.

jonsmirl commented 2 years ago

Also note, Matter does not track who is a member of a group. So I need to build a separate mechanism to track group membership and do the follow up unicast messages.

Has Boris talked to you about virtual composite devices and using them to achieve inter-fabric communication? That diagram above will be hidden inside a virtual composite device.

Abhayakara commented 2 years ago

Sure, but you can send half a million "lights on" packets in a second with 450mbps wifi. So if you have to turn on 50 lights, that's well under a millisecond to send all the "lights on" packets. Of course there are delays going through the router, and that's peak capacity, which you probably don't want to touch, but you still have three orders of magnitude of room there before the user starts to be able to perceive a delay.

Whereas if you use multicast, you can be sure that it will not reach every device on the first transmission. It will use very large timeslices, because multicast on WiFi is very slow. Multicast isn't acknowledged at layer 2, so it's not delivered reliably. And so it's almost certainly going to take no less time to send the multicast than the 50 unicasts.

And then as soon as you have to retransmit, you've lost, because multicast already retransmits several times to improve reliability.

Anyway, this feels like a strange design choice. Why not track which devices are members of the group? What are you saving by not doing that?

jonsmirl commented 2 years ago

Note that there are three things that can control the group -- the voice assistant and the two switches. So the group has to be tracked in all three of them. And it also has to be kept in sync. I can certainly build that, but it is just something we have not gotten around to yet. Maybe I should switch over to that model. I will need to make custom clusters to implement the group tracking since no standard Matter commands for that.

I will admit that the automation systems with observable popcorning have much slower network speeds.

Abhayakara commented 2 years ago

When we've seen popcorn, it's been with bluetooth. Even with Thread, I don't see popcorning when I tell Siri to turn all the lights on. Personally I'm more worried about multicast failing to reach some devices due to lack of acknowledgment messages. So yeah, if you're concerned about this I think implementing group tracking is a must, even if you're doing multicast.

That said, it'd be pretty easy to do—just add the group to the DNSSD advertisements. You can do this either in the text record, or with DNSSD subtypes. You're still doing multicast, but now you're only doing it for service discovery, and you can cache old data, because there's zero cost to trying to actuate a light that's gone. If DNSSD is working properly, which it generally is, then you now have a list of all the lights you need to talk to, and the lights themselves are maintaining it, just as with your multicast group approach.

The best part of this is that when we have the Matter AP stuff in the field, you can use unicast for DNSSD, because that's one of the goals for Matter AP. So now your only reliance on multicast is for neighbor discovery. This is a lot more fault tolerant than using multicast for control messages.

jonsmirl commented 2 years ago

Doing group tracking with DNSSD would have to be supported in Matter core. If I'm grouping three light bulbs into group G7, then the three light bulbs need to publish subtype=_G7._sub._matter._tcp And that means the code has to live in the light bulbs which we don't make. Then my switch can query for _G7._sub._matter._tcp to find the group members. DNSSD does work for this, but the support has to be in the core and everyone has to use it.

The existing Matter group implementation could be transparently switched onto this new scheme. Existing code can't tell if the group command is being sent via multicast or multiple unicast messages. This DNS scheme is probably better than the existing multicast implementation since it avoids the possibility of broken multicast in the routers. Of course this would need a vote.

Abhayakara commented 2 years ago

Subtypes are part of RFC6763, so in theory everyone should implement them, but I don't know how well this is supported in Avahi. But advertising subtypes is really easy, so if it's not there already, it should be straightforward to add. Personally this approach would be my preferred approach. I don't normally show up for these meetings, but if you think it would help, I can. I think Stuart shows up, but he's on vacation next week (and was this week).

jonsmirl commented 2 years ago

I used avahi to play with it, it works fine.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

yf-fan-org commented 9 months ago

I used avahi to play with it, it works fine.

Hello, may I ask you how to effectively solve this problem, I also encountered: when group control multiple devices, the device response is not synchronized, and even individual devices sometimes do not respond to the group action problem; thank you very much!

jonsmirl commented 9 months ago

I still don't have it working reliably. It is also extremely hard to debug with what is going on with all of the rotating MACs, encryption, packets hopping between Ethernet and wifi/thread, etc. You can't just turn on a Sniffer and see what failed. I can see the packets get transmitted, but then they don't arrive at their destination. Why? that is a difficult question. It's not random drops because sometimes I can poke a button hooked up to multicast and it will fail ten times in a row, then start working. Another failure I see is that multicast will fail between a pair, but then if I send a unicast command over the same path multicast will start working.

One thing I am suspicious of is spanning tree algorithms used in routers to route IPv6 multicast. I have identified and thrown out two old routers (10 years old with no updates available) where IPv6 multicast was not functioning correctly. But those were easy to identify, IPv6 multicast simply would not pass through them.

AFIAK there are no tools available to robustly test if IPv6 multicast is working correctly on a network. Also, I also don't know which networking layer is causing the failures.

One thing I am sure of is that the failures I see are not the result of random drops due to collisions. The failures are semi-reproducible as in they keep coming back in the same failure mode; but i can't trigger the failure on demand. If it was random drops I would not see patterns in the failures.

This problem needs large company testing resources thrown at it, much more than I have available.

yf-fan-org commented 9 months ago

.

Thank you very much for your answer! When I tested it, it was related to the same phenomenon you encountered, I used tcpdump to see that multicast traffic was sent out, and then I connected to a wifi that was a hotspot on my computer, and on my computer I was able to see the multicast ip's traffic by using wireshark, but there were individual devices in the group that would intermittently fail to receive the multicast messages.