project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.55k stars 2.04k forks source link

[BUG] Current commisionable DNS record matter._udp gets registered to SRP server with TTL = 7200 (2hr), but commisioning window is ~15m. #24909

Open AlanLCollins opened 1 year ago

AlanLCollins commented 1 year ago

Reproduction steps

  1. Perform ble-thread to a Matter light-bulb example accessory.
  2. Confirm normal operation (send / read commands), including that BR is advertising both _udp and _tcp records over wifi.
  3. FactoryReset the accessory device before the existing commissioning window completes (wihtin 15 min).

The _udp record will continue stale in the SRP server for 2 hrs Any Discovery over wifi will provide an IPv6 address to a device that no longer exists. So provisioning attempts will fail.

proposal:

  1. close the commisioning window as soon as the CommissioningComplete command is received using General Commissioning Cluster.
  2. and/or modify the TTL for _udp record to match the time for the commisioning window.

Bug prevalence

always

GitHub hash of the SDK that was being used

561d23d0db215a99705ff0696e73853c8edf11b2

Platform

android

Platform Version(s)

No response

Anything else?

No response

jwhui commented 1 year ago

I would suggest specifying a shorter SRP lease time. I believe the current Matter implementation does not specify a value and relies on the default SRP lease time.

bzbarsky-apple commented 1 year ago

including that BR is advertising both _udp and _tcp records over wifi.

Why are there _udp records? Does the device support extended commissionable discovery? What do those records look like, and in particular, do they have CM=1 or CM=0? @AlanLCollins

close the commisioning window as soon as the CommissioningComplete command is received using General Commissioning Cluster.

Yes, we do that. I thought we were supposed to withdraw the now-stale DNS-SD records when that happens. @Damian-Nordic ?

and/or modify the TTL for _udp record to match the time for the commisioning window.

See also https://github.com/project-chip/connectedhomeip/issues/16858.

AlanLCollins commented 1 year ago

Hello @bzbarsky-apple , the record has CM=0. See sniffer log attached. Nwk key = f294a6ccd5e88a0b0df947dec263a385 02-09-Matter_provissioning-Golden.pcapng.zip What is the expected behavior of the CM flag? I was expecting to see it CM=1, and after CommissioningComplete command is received, light-bulb to send DNS update with CM=0. Does the CM flag best behavior requires specific commit from the CHIP SDK ?

The _udp record is withdrawn after 15m, the light-bulb sends DNS update with TTL = 0s.

mdamle commented 1 year ago

Here's another idea for a solution. When a Thread devices dis-associates from the network, it sends a "I am leaving the network" broadcast. Any Thread devices on the network that have cached data associated with this device, listen to this broadcast and proactively purge their data. BRs can purge SRP services registered for this device and any entries in neighbor/routing tables. The Thread devices can send this broadcast when:

  1. commissioning failed (arm fail safe timer expires) and its dis-associates from the configured Thread network.
  2. When it leaves the last fabric and the device is not functional without Matter (that is not a Smart TV/toaster than can work outside of Matter).
  3. factory-reset if platform impls can issue it before resetting.
Damian-Nordic commented 1 year ago

I am a bit reluctant to the solution that involves sending SRP updates on the factory reset as I think the factory reset should include as few steps as possible to increase chances of successful recovery of a device that has entered a faulty state. On the other hand, we already generate Leave events in such a scenario, so maybe that's not a big deal after all... I can't make up my mind :)

Anyway, reducing the lease time sounds like a good idea to me as it should be easy to implement, and it would also help for the case when the device has lost the connectivity with the border router. Note that OpenThread uses a single lease time for the entire SRP update, so we can't set a shorter lease time just for the _matterc._udp service. It shouldn't be a big problem though as we might do sth like this:

mdamle commented 1 year ago

I am a bit reluctant to the solution that involves sending SRP updates on the factory reset as I think the factory reset should include as few steps as possible to increase chances of successful recovery of a device that has entered a faulty state. On the other hand, we already generate Leave events in such a scenario,

Ah so a leave event already exists and is sent by Thread device logically exiting a Thread network. That is good to know.

For shorter lease based solution, referencing: https://github.com/project-chip/connectedhomeip/issues/16858#issuecomment-1422109811 , it seems like Thread BR decides what lease time to set. Can the Thread device request/suggest a suitable lease time for a given SRP update?

bzbarsky-apple commented 1 year ago

Hello @bzbarsky-apple , the record has CM=0. See sniffer log attached.

OK, so that's an extended commissioning advertisement. But the original issue report said:

Any Discovery over wifi will provide an IPv6 address to a device that no longer exists. So provisioning attempts will fail.

which should not really be the case. Commissioning discovery should (and does, afaik) ignore CM=0 DNS-SD results, precisely because those do not represent commissionable devices.

What is the expected behavior of the CM flag?

CM=1 means "commissioning window is open". CM=0 means "commissioning window is not open, but I am advertising so you will have the pairing hints to try to open a commissioning window if you decide you want to do that".

Commissioners that are not trying to do discovery of things that are not commissionable but could be should ignore CM=0 results.

Commissionees that are not trying to implement the extended discovery behavior (e.g. not providing useful pairing hint TXT records) should turn off extended discovery altogether.

Commissionees that do want to do extended discovery right now switch from advertising with CM=1 to advertising with CM=0 when the commissioning window closes.

The _udp record is withdrawn after 15m

Yes, when the extended discovery times out...

Here's another idea for a solution.

A solution to what exact problem?

AlanLCollins commented 1 year ago

Thanks @bzbarsky-apple , very clear description:

CM=1 means "commissioning window is open". CM=0 means "commissioning window is not open, but I am advertising so you will have the pairing hints to try to open a commissioning window if you decide you want to do that".

is this behavior included in CHIP SDK1.0, so all existing Matter-certified devices should be complaint ?

bzbarsky-apple commented 1 year ago

is this behavior included in CHIP SDK1.0

Yes.

so all existing Matter-certified devices should be complaint ?

I don't know whether we have specific certification tests for this... But we did do some manual developer testing with both minimal mdns and some of the platform mdns backends (definitely the darwin one).

AlanLCollins commented 1 year ago

Awesome. Well, I can confirm that Light-bulb example (min-mdns) running on Nordic's-dk (nrf tag: v2.2.0) with default settings is giving me matterc._udp CM=0. My new understanding of the behavior solves my original concern. However there is still an area of improvement in the scenario which light-bulb gets factory reset during 2nd admin flow, and we end up with matterc._udp record CM=1 stale in the border router for 2 hrs. In that case, we need CHIP SDK to implement @Damian-Nordic suggestions.

bzbarsky-apple commented 1 year ago

Yes, "factory reset while commissioning window is open" will absolutely cause problems. If we can do what @Damian-Nordic suggests, that would be very nice.

jwhui commented 1 year ago

I agree with @Damian-Nordic 's proposal.

SRP and OpenThread already support the client specifying lease intervals on a given service instance.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.