project-chip / rs-matter

Rust implementation of the Matter protocol. Status: Experimental
Apache License 2.0
303 stars 43 forks source link

CASE re-arm for a FailSafe armed over a PASE session initially #170

Closed ivmarkov closed 1 month ago

ivmarkov commented 2 months ago

@kedars @andreilitvin This PR is in Draft status, as we need to get on the same page w.r.t. what is happening first, and then brainstorm a bit how to properly address the issue.

Background

I'm currently experiencing the following weird case with non-concurrent connection flow during commissioning (case seems to be described in section 5.5 of the Matter Core Spec):

... and that's the problem. The existing code which is in place does not allow failsafe re-arms which are not coming from the same session type which has armed the FailSafe in the first place, so the re-arming (and the whole provisioning) fails!

PR solution

Now, the PR is a bit of a "patch" in that it expects:

Alternatives

Now, looking at the C++ SDK FailSafe codebase, I don't see any checks w.r.t. whether the re-arm is coming from the correct session type (i.e. from a PASE session for a PASE failsafe and from a CASE session for a CASE failsafe).

In fact, I don't even see the "session type" to be preserved in the C++ FailSafe state at all? Am I looking at the right spot? Or are we overly paranoid in rs-matter?

Shall we just allow re-arms from ANY session type? Therefore, shall we change the Failsafe impl to not really keep track of the session type?

ivmarkov commented 2 months ago

Hmmm, or shall we just clear the FailSafe when we switch from BLE to Wifi? That should also be app-induced though? And we'll lose the NOC state...

andy31415 commented 2 months ago

Hmmm, or shall we just clear the FailSafe when we switch from BLE to Wifi? That should also be app-induced though? And we'll lose the NOC state...

I think FailSafe is supposed to protect against any configuration errors ... even if you succeed to connect to WiFi, it may be that WiFi is not actually working (e.g. blocking mdns so devices cannot see each other). I believe you have to keep the failsafe on until you confirm that the other side can talk to you, which is what the case part does.

So without actually knowing much detail, I would be inclined to think we cannot clear failsafes at any point except when we are explicitly told "done" when the entire commissioning succeeds.

ivmarkov commented 2 months ago

Hmmm, or shall we just clear the FailSafe when we switch from BLE to Wifi? That should also be app-induced though? And we'll lose the NOC state...

I think FailSafe is supposed to protect against any configuration errors ... even if you succeed to connect to WiFi, it may be that WiFi is not actually working (e.g. blocking mdns so devices cannot see each other). I believe you have to keep the failsafe on until you confirm that the other side can talk to you, which is what the case part does.

So without actually knowing much detail, I would be inclined to think we cannot clear failsafes at any point except when we are explicitly told "done" when the entire commissioning succeeds.

I agree it makes sense! But then by implication we should really - somehow - support the case where the fail-safe is initially armed over PASE (on the BTP channel, not that it matters) - yet - the "re-arm" comes via CASE (on the wifi/udp channel).

kedars commented 2 months ago

In fact, I don't even see the "session type" to be preserved in the C++ FailSafe state at all? Am I looking at the right spot? Or are we overly paranoid in rs-matter?

IIRC, the session_mode was added to the context, for capturing the Accessing Fabric Index (which is maintained in the SessionMode::Case.

I think the problem is caused because we don't implement this part from the spec which says what should be part of the FaileSafe Context:

"A Fabric Index for the fabric-scoping of the context, starting at the accessing fabric index for the ArmFailSafe command, and updated with the Fabric Index associated with an AddNOC command or an UpdateNOC command being invoked successfully during the ongoing Fail-Safe timer period"

Once an AddNOC is received, this should update the mode to SessionMode::Case(new_fabric_index) thus fixing the problem in the current scenario?

ivmarkov commented 2 months ago

Once an AddNOC is received, this should update the mode to SessionMode::Case(new_fabric_index) thus fixing the problem in the current scenario?

Yes, but only if after AddNOC the comissioner does not send us Failsafe "ReArm" still over BLE.

Because if it does so, it will do it over the PASE session, while we have just changed our expectation to be, that any failsafe commands will come over a CASE session.

ivmarkov commented 2 months ago

In any case, let me try what you suggest.

kedars commented 2 months ago

Once an AddNOC is received, this should update the mode to SessionMode::Case(new_fabric_index) thus fixing the problem in the current scenario?

Yes, but only if after AddNOC the comissioner does not send us Failsafe "ReArm" still over BLE.

Because if it does so, it will do it over the PASE session, while we have just changed our expectation to be, that any failsafe commands will come over a CASE session.

The way I interpret the specification here is that the ownership of the context is now passed to the new Fabric Id. So doing a rearm from the previous Fabric (or a PASE session) should be against the spec.

ivmarkov commented 1 month ago

@kedars Finally back to this issue today, after a flurry of other bugfixes.

The bad news: no, your suggestion won't work :( ... and yes, we DO receive the second ArmFailSafe request STILL over the original PASE session, as I was afraid initially. (I'll provide the logs shortly, for your inspection - currently shortening them a bit so that you don't have to grep too much.)

The good news: I'm now sure that you are slightly misinterpreting the spec and missing an important paragraph in there. :) You said:

Once an AddNOC is received, this should update the mode to SessionMode::Case(new_fabric_index) thus fixing the problem in the current scenario?

Yes, but only if after AddNOC the comissioner does not send us Failsafe "ReArm" still over BLE. Because if it does so, it will do it over the PASE session, while we have just changed our expectation to be, that any failsafe commands will come over a CASE session.

The way I interpret the specification here is that the ownership of the context is now passed to the new Fabric Id. So doing a rearm from the previous Fabric (or a PASE session) should be against the spec.

Yes, the re-arm comes from the same old PASE session. :) BUT: it is not from the previous fabric! we should have upgraded our PASE session to the new fabric (and by the way, the fact that our current code does not even support associating a fabric index with a PASE session is wrong, will fix that). The spec says (the paragraph I mentioned you are missing):

" If the current secure session was established with PASE, the receiver SHALL: a. Augment the secure session context with the FabricIndex generated above, such that subse­quent interactions have the proper accessing fabric. "

The C++ SDK faithfully does just that. (And supports fabric index for PASE sessions too and supports upgrading the PASE session fabric index - but does not support upgrading the fabric index of a CASE session - which the spec prohibits explicitly further down in the paragraph I started quoting.)

The above by the way means that - within the Failsafe struct - we should not even care what was the session type that opened the failsafe originally - as we do now (which I'll remove), while the C++ SDK doesn't. We only need to know what is the current fabric index in the failsafe (if any), and whether add noc and/or update noc were called. Just like in the C++ SDK.

ivmarkov commented 1 month ago

Here, a shortened log for your entertainment (all over BLE btw). Grep the !!!!!!!!!! stuff I've embedded:

>>>>> BTP 4E:19:47:80:FA:3A [S,SID:0,CTR:a8ed7fe,SRC:6da43a0e9c970244][I,EID:646a,PROTO:0,OP:20]
SC::PBKDFParamRequest
!!!!!!!!!! ^^^ Plain text session starts (SID:0), with PASE negotiation
(skipping the nego that follows)
....................

I (65699) rs_matter::transport::session: New exchange: 1::0 [SID:1,RSID:2182,EID:646b] :: Responder(AcceptPending)
I (65699) rs_matter::transport::core: 
>>>>> BTP 4E:19:47:80:FA:3A [SID:1,CTR:2005f9][I,EID:646b,PROTO:1,OP:2]
IM::ReadRequest
 => Processing (new exchange)
I (65719) rs_matter::transport::core: Exchange 1::0 [SID:1,RSID:2182,EID:646b]: Accepted
I (65719) rs_matter::respond: Responder: Handler 0 / exchange 1::0: Starting
I (65739) rs_matter::transport::exchange: 
!!!!!!!!!! ^^^ Here's our newly-negotiated PASE session (SID:1)
(skipping non-interesting stuff thast follows, like a bunch of IM::ReadRequest reads)
....................

I (66939) rs_matter::transport::network::btp::session: 
>>>>> (BTP IO) 4E:19:47:80:FA:3A [A|B|E,ACTR:b,CTR:a,LEN:41]
READ 65B
I (66959) rs_matter::transport::session: New exchange: 1::0 [SID:1,RSID:2182,EID:6472] :: Responder(AcceptPending)
I (66959) rs_matter::transport::core: 
>>>>> BTP 4E:19:47:80:FA:3A [SID:1,CTR:200600][I,EID:6472,PROTO:1,OP:8]
IM::InvokeRequest
 => Processing (new exchange)
I (66979) light: Light toggled
I (66979) rs_matter::transport::core: Exchange 1::0 [SID:1,RSID:2182,EID:6472]: Accepted
I (66989) rs_matter::respond: Responder: Handler 0 / exchange 1::0: Starting
I (66999) rs_matter::data_model::sdm::general_commissioning: Handling command ARM Fail Safe
I (67009) rs_matter::transport::exchange: 
!!!!!!!!!! ^^^ Starts to get interesting, we got Arm fail safe over the PASE session
(again, skipping non-interesting stuff that follows, like regulatory config, cert chain request and so on)
....................

I (73149) rs_matter::transport::network::btp::session: 
<<<<< (BTP IO) 4E:19:47:80:FA:3A [A|B|E,ACTR:11,CTR:19,LEN:3f]
WRITE 63B
I (73219) rs_matter::transport::network::btp::session: 
>>>>> (BTP IO) 4E:19:47:80:FA:3A [A|B,ACTR:19,CTR:12,LEN:15b]
READ 239B
I (73279) rs_matter::transport::network::btp::session: 
>>>>> (BTP IO) 4E:19:47:80:FA:3A [C|E,CTR:13]
READ 108B
I (73289) rs_matter::transport::session: New exchange: 1::0 [SID:1,RSID:2182,EID:6479] :: Responder(AcceptPending)
I (73289) rs_matter::transport::core: 
>>>>> BTP 4E:19:47:80:FA:3A [SID:1,CTR:200607][I,EID:6479,PROTO:1,OP:8]
IM::InvokeRequest
 => Processing (new exchange)
I (73309) rs_matter::transport::core: Exchange 1::0 [SID:1,RSID:2182,EID:6479]: Accepted
I (73309) rs_matter::respond: Responder: Handler 0 / exchange 1::0: Starting
I (73319) rs_matter::data_model::sdm::noc: Handling command AddNOC
I (73329) rs_matter::data_model::sdm::noc: Received NOC as:  Version: [0]
      [2]
 Serial Num: [1]
 Signature Algorithm:
     ECDSA with SHA256
 Issuer: ... (let's skip the cert)
!!!!!!!!!! ^^^ OK, here we get the NOC over our PASE session
(again, I'll skip non-interesting stuff that follows, like configuring Wifi stuff (we are still running over BLE))
....................

I (73689) rs_matter::transport::network::btp::session: 
<<<<< (BTP IO) 4E:19:47:80:FA:3A [A|B|E,ACTR:14,CTR:1b,LEN:3f]
WRITE 63B
I (73759) rs_matter::transport::network::btp::session: 
>>>>> (BTP IO) 4E:19:47:80:FA:3A [A|B|E,ACTR:1b,CTR:15,LEN:41]
READ 65B
I (73769) rs_matter::transport::session: New exchange: 1::0 [SID:1,RSID:2182,EID:647b] :: Responder(AcceptPending)
I (73769) rs_matter::transport::core: 
>>>>> BTP 4E:19:47:80:FA:3A [SID:1,CTR:200609][I,EID:647b,PROTO:1,OP:8]
IM::InvokeRequest
 => Processing (new exchange)
I (73789) rs_matter::transport::core: Exchange 1::0 [SID:1,RSID:2182,EID:647b]: Accepted
I (73799) rs_matter::respond: Responder: Handler 0 / exchange 1::0: Starting
I (73809) rs_matter::data_model::sdm::general_commissioning: Handling command ARM Fail Safe
I (73819) rs_matter::transport::exchange: 
<<<<< BTP 4E:19:47:80:FA:3A [SID:2182,CTR:b790a91][EID:647b,PROTO:1,OP:9]
IM::InvokeResponse
!!!!!!!!!! ^^^ And... this is the problem! We just got a SECOND ArmFailSafe over... our PASE session! (SID:1) - we haven't seen a SINGLE Case session request (and won't see it in BLE)
!!!!!!!!!! So here, we'll error out if we have changed in the failsafe `session_type` to be SessionType::Case(...)

(below, non-interesting stuff I'll skip)
....................
ivmarkov commented 1 month ago

I think this is now ready for review.

The reason why this PR has a bigger changeset is because I changed the representation of fabric_index from u8 to NonZeroU8 in all places, where we do expect a valid fabric index (i.e. where 0 - which is "no fabric" or "invalid fabric" - is actually not expected and should not be accepted). For example:

While NonZeroU8 cannot protect us from fabric index 255 (which is also invalid), I think the fact that we can now differentiate between the two cases (no fabric is OK vs fabric should be there) is already beneficial for the readibility of the code.

Btw, Option<NonZeroU8> is technically and representationally equivalent to u8 in that the None value is modeled by the compiler with the value "0". So I preserved u8 to represent the case where "no fabric is OK", also because NonZeroU8::new(<u8>) gives Option<NonZeroU8> anyway i.e. converting between these two equivalent representations is trivial.