Closed ivmarkov closed 1 month ago
Hmmm, or shall we just clear the FailSafe when we switch from BLE to Wifi? That should also be app-induced though? And we'll lose the NOC state...
Hmmm, or shall we just clear the FailSafe when we switch from BLE to Wifi? That should also be app-induced though? And we'll lose the NOC state...
I think FailSafe is supposed to protect against any configuration errors ... even if you succeed to connect to WiFi, it may be that WiFi is not actually working (e.g. blocking mdns so devices cannot see each other). I believe you have to keep the failsafe on until you confirm that the other side can talk to you, which is what the case part does.
So without actually knowing much detail, I would be inclined to think we cannot clear failsafes at any point except when we are explicitly told "done" when the entire commissioning succeeds.
Hmmm, or shall we just clear the FailSafe when we switch from BLE to Wifi? That should also be app-induced though? And we'll lose the NOC state...
I think FailSafe is supposed to protect against any configuration errors ... even if you succeed to connect to WiFi, it may be that WiFi is not actually working (e.g. blocking mdns so devices cannot see each other). I believe you have to keep the failsafe on until you confirm that the other side can talk to you, which is what the case part does.
So without actually knowing much detail, I would be inclined to think we cannot clear failsafes at any point except when we are explicitly told "done" when the entire commissioning succeeds.
I agree it makes sense! But then by implication we should really - somehow - support the case where the fail-safe is initially armed over PASE (on the BTP channel, not that it matters) - yet - the "re-arm" comes via CASE (on the wifi/udp channel).
In fact, I don't even see the "session type" to be preserved in the C++ FailSafe state at all? Am I looking at the right spot? Or are we overly paranoid in rs-matter?
IIRC, the session_mode
was added to the context, for capturing the Accessing Fabric Index (which is maintained in the SessionMode::Case
.
I think the problem is caused because we don't implement this part from the spec which says what should be part of the FaileSafe Context:
"A Fabric Index for the fabric-scoping of the context, starting at the accessing fabric index for the ArmFailSafe command, and updated with the Fabric Index associated with an AddNOC command or an UpdateNOC command being invoked successfully during the ongoing Fail-Safe timer period"
Once an AddNOC
is received, this should update the mode to SessionMode::Case(new_fabric_index)
thus fixing the problem in the current scenario?
Once an
AddNOC
is received, this should update the mode toSessionMode::Case(new_fabric_index)
thus fixing the problem in the current scenario?
Yes, but only if after AddNOC
the comissioner does not send us Failsafe "ReArm" still over BLE.
Because if it does so, it will do it over the PASE session, while we have just changed our expectation to be, that any failsafe commands will come over a CASE session.
In any case, let me try what you suggest.
Once an
AddNOC
is received, this should update the mode toSessionMode::Case(new_fabric_index)
thus fixing the problem in the current scenario?Yes, but only if after
AddNOC
the comissioner does not send us Failsafe "ReArm" still over BLE.Because if it does so, it will do it over the PASE session, while we have just changed our expectation to be, that any failsafe commands will come over a CASE session.
The way I interpret the specification here is that the ownership of the context is now passed to the new Fabric Id. So doing a rearm from the previous Fabric (or a PASE session) should be against the spec.
@kedars Finally back to this issue today, after a flurry of other bugfixes.
The bad news: no, your suggestion won't work :(
... and yes, we DO receive the second ArmFailSafe
request STILL over the original PASE session, as I was afraid initially.
(I'll provide the logs shortly, for your inspection - currently shortening them a bit so that you don't have to grep too much.)
The good news: I'm now sure that you are slightly misinterpreting the spec and missing an important paragraph in there. :) You said:
Once an
AddNOC
is received, this should update the mode toSessionMode::Case(new_fabric_index)
thus fixing the problem in the current scenario?Yes, but only if after
AddNOC
the comissioner does not send us Failsafe "ReArm" still over BLE. Because if it does so, it will do it over the PASE session, while we have just changed our expectation to be, that any failsafe commands will come over a CASE session.The way I interpret the specification here is that the ownership of the context is now passed to the new Fabric Id. So doing a rearm from the previous Fabric (or a PASE session) should be against the spec.
Yes, the re-arm comes from the same old PASE session. :) BUT: it is not from the previous fabric! we should have upgraded our PASE session to the new fabric (and by the way, the fact that our current code does not even support associating a fabric index with a PASE session is wrong, will fix that). The spec says (the paragraph I mentioned you are missing):
" If the current secure session was established with PASE, the receiver SHALL: a. Augment the secure session context with the FabricIndex generated above, such that subsequent interactions have the proper accessing fabric. "
The C++ SDK faithfully does just that. (And supports fabric index for PASE sessions too and supports upgrading the PASE session fabric index - but does not support upgrading the fabric index of a CASE session - which the spec prohibits explicitly further down in the paragraph I started quoting.)
The above by the way means that - within the Failsafe
struct - we should not even care what was the session type that opened the failsafe originally - as we do now (which I'll remove), while the C++ SDK doesn't. We only need to know what is the current fabric index in the failsafe (if any), and whether add noc and/or update noc were called. Just like in the C++ SDK.
Here, a shortened log for your entertainment (all over BLE btw). Grep the !!!!!!!!!!
stuff I've embedded:
>>>>> BTP 4E:19:47:80:FA:3A [S,SID:0,CTR:a8ed7fe,SRC:6da43a0e9c970244][I,EID:646a,PROTO:0,OP:20]
SC::PBKDFParamRequest
!!!!!!!!!! ^^^ Plain text session starts (SID:0), with PASE negotiation
(skipping the nego that follows)
....................
I (65699) rs_matter::transport::session: New exchange: 1::0 [SID:1,RSID:2182,EID:646b] :: Responder(AcceptPending)
I (65699) rs_matter::transport::core:
>>>>> BTP 4E:19:47:80:FA:3A [SID:1,CTR:2005f9][I,EID:646b,PROTO:1,OP:2]
IM::ReadRequest
=> Processing (new exchange)
I (65719) rs_matter::transport::core: Exchange 1::0 [SID:1,RSID:2182,EID:646b]: Accepted
I (65719) rs_matter::respond: Responder: Handler 0 / exchange 1::0: Starting
I (65739) rs_matter::transport::exchange:
!!!!!!!!!! ^^^ Here's our newly-negotiated PASE session (SID:1)
(skipping non-interesting stuff thast follows, like a bunch of IM::ReadRequest reads)
....................
I (66939) rs_matter::transport::network::btp::session:
>>>>> (BTP IO) 4E:19:47:80:FA:3A [A|B|E,ACTR:b,CTR:a,LEN:41]
READ 65B
I (66959) rs_matter::transport::session: New exchange: 1::0 [SID:1,RSID:2182,EID:6472] :: Responder(AcceptPending)
I (66959) rs_matter::transport::core:
>>>>> BTP 4E:19:47:80:FA:3A [SID:1,CTR:200600][I,EID:6472,PROTO:1,OP:8]
IM::InvokeRequest
=> Processing (new exchange)
I (66979) light: Light toggled
I (66979) rs_matter::transport::core: Exchange 1::0 [SID:1,RSID:2182,EID:6472]: Accepted
I (66989) rs_matter::respond: Responder: Handler 0 / exchange 1::0: Starting
I (66999) rs_matter::data_model::sdm::general_commissioning: Handling command ARM Fail Safe
I (67009) rs_matter::transport::exchange:
!!!!!!!!!! ^^^ Starts to get interesting, we got Arm fail safe over the PASE session
(again, skipping non-interesting stuff that follows, like regulatory config, cert chain request and so on)
....................
I (73149) rs_matter::transport::network::btp::session:
<<<<< (BTP IO) 4E:19:47:80:FA:3A [A|B|E,ACTR:11,CTR:19,LEN:3f]
WRITE 63B
I (73219) rs_matter::transport::network::btp::session:
>>>>> (BTP IO) 4E:19:47:80:FA:3A [A|B,ACTR:19,CTR:12,LEN:15b]
READ 239B
I (73279) rs_matter::transport::network::btp::session:
>>>>> (BTP IO) 4E:19:47:80:FA:3A [C|E,CTR:13]
READ 108B
I (73289) rs_matter::transport::session: New exchange: 1::0 [SID:1,RSID:2182,EID:6479] :: Responder(AcceptPending)
I (73289) rs_matter::transport::core:
>>>>> BTP 4E:19:47:80:FA:3A [SID:1,CTR:200607][I,EID:6479,PROTO:1,OP:8]
IM::InvokeRequest
=> Processing (new exchange)
I (73309) rs_matter::transport::core: Exchange 1::0 [SID:1,RSID:2182,EID:6479]: Accepted
I (73309) rs_matter::respond: Responder: Handler 0 / exchange 1::0: Starting
I (73319) rs_matter::data_model::sdm::noc: Handling command AddNOC
I (73329) rs_matter::data_model::sdm::noc: Received NOC as: Version: [0]
[2]
Serial Num: [1]
Signature Algorithm:
ECDSA with SHA256
Issuer: ... (let's skip the cert)
!!!!!!!!!! ^^^ OK, here we get the NOC over our PASE session
(again, I'll skip non-interesting stuff that follows, like configuring Wifi stuff (we are still running over BLE))
....................
I (73689) rs_matter::transport::network::btp::session:
<<<<< (BTP IO) 4E:19:47:80:FA:3A [A|B|E,ACTR:14,CTR:1b,LEN:3f]
WRITE 63B
I (73759) rs_matter::transport::network::btp::session:
>>>>> (BTP IO) 4E:19:47:80:FA:3A [A|B|E,ACTR:1b,CTR:15,LEN:41]
READ 65B
I (73769) rs_matter::transport::session: New exchange: 1::0 [SID:1,RSID:2182,EID:647b] :: Responder(AcceptPending)
I (73769) rs_matter::transport::core:
>>>>> BTP 4E:19:47:80:FA:3A [SID:1,CTR:200609][I,EID:647b,PROTO:1,OP:8]
IM::InvokeRequest
=> Processing (new exchange)
I (73789) rs_matter::transport::core: Exchange 1::0 [SID:1,RSID:2182,EID:647b]: Accepted
I (73799) rs_matter::respond: Responder: Handler 0 / exchange 1::0: Starting
I (73809) rs_matter::data_model::sdm::general_commissioning: Handling command ARM Fail Safe
I (73819) rs_matter::transport::exchange:
<<<<< BTP 4E:19:47:80:FA:3A [SID:2182,CTR:b790a91][EID:647b,PROTO:1,OP:9]
IM::InvokeResponse
!!!!!!!!!! ^^^ And... this is the problem! We just got a SECOND ArmFailSafe over... our PASE session! (SID:1) - we haven't seen a SINGLE Case session request (and won't see it in BLE)
!!!!!!!!!! So here, we'll error out if we have changed in the failsafe `session_type` to be SessionType::Case(...)
(below, non-interesting stuff I'll skip)
....................
I think this is now ready for review.
The reason why this PR has a bigger changeset is because I changed the representation of fabric_index
from u8
to NonZeroU8
in all places, where we do expect a valid fabric index (i.e. where 0 - which is "no fabric" or "invalid fabric" - is actually not expected and should not be accepted). For example:
While NonZeroU8
cannot protect us from fabric index 255 (which is also invalid), I think the fact that we can now differentiate between the two cases (no fabric is OK vs fabric should be there) is already beneficial for the readibility of the code.
Btw, Option<NonZeroU8>
is technically and representationally equivalent to u8
in that the None
value is modeled by the compiler with the value "0". So I preserved u8
to represent the case where "no fabric is OK", also because NonZeroU8::new(<u8>)
gives Option<NonZeroU8>
anyway i.e. converting between these two equivalent representations is trivial.
@kedars @andreilitvin This PR is in Draft status, as we need to get on the same page w.r.t. what is happening first, and then brainstorm a bit how to properly address the issue.
Background
I'm currently experiencing the following weird case with non-concurrent connection flow during commissioning (case seems to be described in section 5.5 of the Matter Core Spec):
ConnectNetwork
request on the Wifi cluster, it does switch from BLE to Wifi as the main transport...ConnectNetwork
even though we have not actually connected to the Wifi network, but oh well, that's what the Spec mandates the flow to be ... and this is what the C++ SDK does too)FailSafe Re-Arm
request on behalf of a CASE session (which was just negotiated)... and that's the problem. The existing code which is in place does not allow failsafe re-arms which are not coming from the same session type which has armed the FailSafe in the first place, so the re-arming (and the whole provisioning) fails!
PR solution
Now, the PR is a bit of a "patch" in that it expects:
rs-matter
to expect a re-arm from a CASE session (the app knows about that because it is the app which drives the Wifi cluster and the overall switch from BLE to Wifi in the "non-concurrent-connection" commissioning flow)CaseRearm
enum that I had to introduce because of thatAlternatives
Now, looking at the C++ SDK FailSafe codebase, I don't see any checks w.r.t. whether the re-arm is coming from the correct session type (i.e. from a PASE session for a PASE failsafe and from a CASE session for a CASE failsafe).
In fact, I don't even see the "session type" to be preserved in the C++ FailSafe state at all? Am I looking at the right spot? Or are we overly paranoid in
rs-matter
?Shall we just allow re-arms from ANY session type? Therefore, shall we change the
Failsafe
impl to not really keep track of the session type?