[CERT-TEST-FAILURE] Test_TC_ACL_2_3/4 are very (too?) project-chip implementation specific

Apollon77 commented 2 months ago

The test cases for ACL (stareting with ACL2_3 and 2_4, but also other, are build very implementation specific to "how project chip" has implemented functionalities like chunked list writes and also ACL list even processing and sending. The tests currently relying on the facts that:

writes are always process in exactly the chunks that chiptool sends in and each chunked array change is processed for it's alone inluding validation and event sending for the whole ACL list (like also resetting the event sends all former entries as deleted). The counter example would be a logic that collects all list chunks from one write request and processes them as one list change - but this would lead to completely different events in several cases
acl list change processing is validated implementation specific as chip has implemented it. (see https://github.com/project-chip/connectedhomeip/issues/33487#issuecomment-2118329316) ... so an other implementation, e.g. processing the deletions also "in increasing order" and not decreasing would also lead to a different event order
even if chip has support for "unchunked lists" in ACl (which would also behave differently as the chunked variant regarding sent events and such) this functionality is never really tested with these tests because chiptool always send list in a chunked way (which formally is against the specs recommendation - yes only "SHOULD", but would still be better regarding the amount of data to be transported over the network especially also with missing tag compression ;-) ...)

This implementation specific tests makes it difficult for alternative implementations, unless all these details would be part of the specification and so become requirements. Sure with the right argumentation vendors using alternative implementations could request exceptions from tests, but it adds a difficulty for alternative implementations of the standard.

So I will leave this open as a "Cert blocker" here and I'm available to chat about the topic.

If I missed specification details where some of the topics are defined then please provide details.

Original issue context

Feature Area

Other

Test Case

Test_TC_ACL_2_3/4

Reproduction steps

Test ACL 2.3: Formally the last successful write is in "step 9" of the test ... but Step 18 expects another value.

It seems Step 18 expects the "first of the second entries" from step 17 ... which gave an ConstraintError because tried to write 2 entries while only one is allowed.

I assume that this is because of the chunked write handling added in https://github.com/project-chip/connectedhomeip/blob/master/src/app/clusters/access-control-server/access-control-server.cpp#L353 which is very chip specific because formally there is no requiremenst that I know that any array needs to be transported in a chunked way ... and also not that writes needs to be handled chunked at all.

In fat it is really unexpected in case of acl (and especially in case of ACL where consistence should be the most important topic) that a write command like in step 17 returns an error, BUT had an effect by writing the first of the two chunked parts ...

It would be great to get some background information for this, thank you!

In general the test will also have issues as soon as a controller implementation is not sending the list "always chunked" as chip does, because the logic in the access-control-server.cpp would return an Constraint error without saving any value when it is >1 entry in the array. So this behavior is not consistent.

The same topic seems to be contained in Test 2.4 - here for ACL entries themself in Step 29/30 https://github.com/project-chip/connectedhomeip/blob/master/src/app/tests/suites/certification/Test_TC_ACL_2_4.yaml#L1068

Bug prevalence

always

GitHub hash of the SDK that was being used

master

Platform

core

Anything else?

PS: matter.js would be such an controller which is not automatically chunks every value. We try to get the array unchunked into the data report and only if this does not fit we fallback to chunk the array.

Follow up question aout behavior because the tests only do two entries. What should actually happen when someone wrtes 3 entries - first valid, second add invalid, third add valid ... Then I would assume with the current logic that the second entry in the target list will have the value from the third chunk, right? Whats the reponse? Still ConstraintError because the second entry had that? Or should the third entry "addition" not be processed because the second failed? Would be cool to get some insights here

cecille commented 2 months ago

@mlepage-google @hasty

Apollon77 commented 2 months ago

Ok the more I look into it the more it becomes clear to me that the expectation with chunked writes is to really process them also in their chunks one by one. Matter.js is currently not doing this that way. I can change that.

With this the question (that is unanswered for me also after consulting specs - or I am too blind) I added above is the one left: If on one chunk an error happens are more chunks for the same path also processed or not? I assume yes. In fact the result would be then one response per chunk, e.g. one ok, one error and one ok ... and it is up to the caller to decide whats now up ;-) Correct?

Thank you for your support.

Apollon77 commented 2 months ago

Another follow up question: how exactly data version protected writes work with chinked writes? On acl cluster the specs tell the administrator to ideally protect writes with data version filters. So… when each chunked part is handled as own write then dataversion changes after each. So after first write that will change unless the controller makes assumptions on how be re ion increases. Is that also correct? So data version protected writes should never be chunked ?!

mlepage-google commented 2 months ago

Hello Appollon.

Some clarifications...

First, the extension attribute is optional and if present, must be of length 1. That is why there is a constraint error when attempting to write a second element to that list.

The ACL attribute is mandatory, and must support at least 3 entries per fabric. Therefore, writing 3 entries will never result in a constraint error due to exhausting list length; writing 4 or more entries might (depending on actual supported length) result in such errors, but they would always be at the end, and never have a successful entry written after exhausting the supported length.

An ACL entry may fail other data validity checks, and fail to write for that reason.

In the SDK, when a list is written (say with 3 elements), it is done as follows:

List cleared via list replace (with empty list) action.
First element added via list item append action.
Second element added via list item append action.
Third element added via list item append action.

In that case, if the second item resulted in a failure (e.g. due to invalid item data), the list would still be of length 1 after having attempted to append the second item (and failed). Processing would continue, and the third item would be appended, resulting in a list of length 2. It is as if only the two valid items were sent, except that a write error occurred for the second item.

Apollon77 commented 2 months ago

All right. Thank you for confirmation. In fact this is then also no testing issue and could be closed or moved.

Nevertheless the specification lack details on chunked write operations and how this should work. It was completely unclear to me till now.

In my eyes this behavior is maximum inconsistent because if I send an unchunked array with the 3 items where second is invalid then nothing is written and the constraint error happens for all three. But unassigned that this ship has sailed ;-)

Would you also have an info how chunked writes work with dataversion check on the write? (Also because acl specs propose to use that). In my understanding each processed chunk increase the version so the data version check only works for the first chunk. I know that chip always send array full and so resets the array with the first chunk but the protocol does not dictate that. So someone could really just add items to an existing list. ;-)

Apollon77 commented 2 months ago

Ps: (and slightly off topic) is there a reason why Chip always chunks the data in data reports and writes? If the array would fit into the data report then sending it unchanged would need less bytes then always chunked. Would that not be important for low power devices?

But yes then the test here would get issues because they could behave differently ;-))

mlepage-google commented 2 months ago

The CHIP SDK always chunks writes primarily for implementation simplicity, if I recall correctly.

mlepage-google commented 2 months ago

Regarding the question of how exactly DataVersion works with chunked writes, it's likely the DataVersion increment happens when the last chunk is received, but I'm not sure I recall exactly.

Spec section 10.6.4.7. Collection Types (List) seems to have an example showing a list write as both one IB (interaction block) and multiple IBs (the multiple replace/append blocks I mentioned earlier), each with DataVersion=1, so that would match that behaviour.

Apollon77 commented 2 months ago

The CHIP SDK always chunks writes primarily for implementation simplicity, if I recall correctly.

Ok. Still … in regard to best possibly minimizing traffic that’s maybe nor the best choice. I remember matter 1.3 announcement also contained that less data are transferred as an optimization… that here could be another low hanging fruit ;-) and trust me it is not that complex as it feels. Matter.js does that. ;-)

Additionally now that I think about it - it has the consequence that the current tests might now test all relevant code paths … as example in acl code is there for both options but the „full array“ code path should always ever be executed with zero or one element …

…. Just as thoughts from my opinion.

Apollon77 commented 2 months ago

Re your infos on dataversion: ok thank you for the reference. Puuhh that makes things interesting.

So we have kind of a „transaction“ (summarize multiple changes and „commit“ at the end to increase version) but without having a real „transaction“ (aka rollback everything in case of an error).

And all this over multiple potentially chunked write messages ;-)

Good. Will be fun to implement that. ;-)

Apollon77 commented 2 months ago

When I think deeper then I also ask myself why the spec 1.3 in "10.6.4.3.1. Lists" just tals about "full list" or "clear the list and send appends" ... Combined with a dataVersion security approach it should be easiely possible to just send list modifications. Also the spec examples kind of contain all these examples, just the spec tells is "SHOULD" not be done ... This should be even less bytes needed and should be (because of the dataVersion security guard) end in the same final list ... or?! But yes thats more "spec and protocol feedback" and I have no idea if chip has implemented that ... I assume not becuse eg acl just handled additions beside "full array" and last time I checked in the c++ code I only saw places that handle these two cases, nothing else... Sorry to place it here, but I have no other ways to bring such thoughts/questions to any attention ;-)

mlepage-google commented 2 months ago

… as example in acl code is there for both options but the „full array“ code path should always ever be executed with zero or one element …

In access-control-server.cpp for writing the Extension attribute with an entire list, it only ever handles a list of length 0 or 1 because that attribute list can only ever be length 0 or 1.

https://github.com/project-chip/connectedhomeip/blob/22b9c96b691d3bb5a5d3db510db353b3b2bc2f74/src/app/clusters/access-control-server/access-control-server.cpp#L350

For the ACL attribute, the length of the list being written is not assumed.

https://github.com/project-chip/connectedhomeip/blob/22b9c96b691d3bb5a5d3db510db353b3b2bc2f74/src/app/clusters/access-control-server/access-control-server.cpp#L252

Apollon77 commented 2 months ago

I did most adjustments needed and are currently finalizing all the tests but I need to tell you that the tests here (especially ACE 2_5, 2_6 and such) are veeery chip implementation specific. I now need to mimic the event order based on the code how chip is managing the entries in the relevant cases and send out events. Also the events have the "latestValue" which is not really described in the specs, but in the end is a different value depending on if it is a Removal or other action. All this is nowhere mentioned in the specification. In the end thats all valid and ok to require it, but if such things like the exact event behaviour, orders and such are checked in certification in this depth then also the specification should contain these implementation details as requirements. Thats at least my opinion :-)

In fact I spentsome hours today to just run tests, analyze why it broke, adjust my code even if specs did not contain these details, and do this in loops ;) Same for the specific write request handling for "splitted/chunked list processing". The specification (at least not clear to me) is not containing all the details whats required and tested. I think all details in specification would have saved me a lot of this time.

Please see this as feedback from a developer that implements the protocol mainly after specs (as it should work out).

PS: just as example what I mean with implementation specific: see https://github.com/project-chip/connectedhomeip/blob/f4e02d4373e1ad01e8967d721112caf383b3c1f7/src/app/clusters/access-control-server/access-control-server.cpp#L251-L272

New values are added up-counting 0,1,2 ... old values then removed down-counting 5,4,3 ... and tests check exactly this behavior and order of events ;-)

PS: If someone is interested I could try to collect all the things I would love to get added to the specs and post here, (at least collecting the topics)

Apollon77 commented 1 month ago

In addition to above I also see a spec issue here ...

Spec tells in 6.6.4. Access Control Cluster update side-effects

Updates to the Access Control Cluster SHALL take immediate effect in the Access Control system.

and then having the example of "1. Action: Write (1st path)" ... beside the fact that I would not know any real way to do a write to "just a sub field" currently this is strange: So if i reframe that example to the current chunking-lists realitythen the first write path action is the complete reset of the array to be empty! so the first real value gets added in the second Write action. So if this would become effective really directly then all other acl actions will indeed fail because the acl list is empty.

I know that this example wants to show how a "when changing acl and doing another write action afterwards to an non acl attribute" scenario should work this is still leading me to the above question and more assumptions. Like "immediately effective" could now mean to just get effective as soon as the attribute path changes from acl to any other path - that would at least solve the above issue with "resetting becomes effective immediately" ...

So it would be great to know how this should be really handled with all chunked writes. Does ACL list changes really become effective directly, but "ignoring" the initial write action to clear the list? Or is the new entry just effective after all writes are done? In both cases specification would require an update in my pov.

Also in general this "proceed the writes entries one by one" also could have effects in e.g. bindings or groups. In both cases lists are used to handle things. If we also here clear the lists and then "write one by one with immediately becoming effective" means that if I have 5 groups/bindings and I add a sixths then the first write clears the list and then the first entry gets added and formally all others removed. if this becomes effective directly too that would mean we stop listening to all these groups or disconnect from all bound devices except the first and then we rebuild all these connections and listening sockets again newly and add one more. or am I wrong or miss a point?

Apollon77 commented 1 month ago

Ok, I verified one of the things by experimenting with chip: It seems to me that a "write to the acl attribute" is processed completely in the write message and becomes effective then. This is the reason that clearing the array as first write path action or adding acl entries in strange order is now screwing up the whole acl system. So with this assumption is that chunked list writes are processed as a sequence in chip until the "write path" (regarding endpoint/cluster/attribute) changes and are then being updated in the acl system. At least so it seems to me.

project-chip / connectedhomeip