project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.53k stars 2.03k forks source link

`waitForReport` YAML instruction is racy #19861

Open mrjerryjohns opened 2 years ago

mrjerryjohns commented 2 years ago

Problem

CI Tests have been failing a lot more on Darwin recently. One example of that is here.

In that test, there is CPU starvation/context switching happening that's exposing some racy test logic.

That test in particular does a few things:

  1. Sends UpOrOpen command to get the window blind 'moving'. That should change the OperationalStatus value immediately to 0x21.
  2. Waits for 2s to get some initial movement.
  3. Subscribes to the OperationalStatus attribute with a min-interval of 4s, max of 5s, which returns back the value of 0x21. However, at this point, the test app side has a big, unexplained jump in timestamp for some reason:
2022-06-21T19:30:30.6266150Z 2022-06-21 19:30:30.626 ERROR   19:29:08.528 - TEST OUT  : [0;34m[1655839748528] [28482:124272] CHIP: [DMG]                DataVersion = 0x7da58729,[0m
2022-06-21T19:30:30.6267080Z 2022-06-21 19:30:30.626 ERROR   19:29:08.528 - TEST OUT  : [0;34m[1655839748528] [28482:124272] CHIP: [DMG]                AttributePathIB =[0m
2022-06-21T19:30:30.6267950Z 2022-06-21 19:30:30.626 ERROR   19:29:19.707 - TEST OUT  : [0;34m[1655839748528] [28482:124272] CHIP: [DMG]                {[0m
2022-06-21T19:30:30.6269390Z 2022-06-21 19:30:30.626 ERROR   19:29:19.707 - TEST OUT  : [0;34m[1655839748528] [28482:124272] CHIP: [DMG]                    Endpoint = 0x1,[0m
2022-06-21T19:30:30.6270010Z 2022-06-21 19:30:30.626 ERROR   19:29:19.707 - TEST OUT  : [0;34m[1655839748528] [28482:124272] CHIP: [DMG]                    Cluster = 0x102,[0m

This time jump of 10s happens right while it's parsing the data.

  1. Then, we call StopMotion on the target. That results in a report being generated immediately containing a value of 0 for the status back to the client, because on the server, the minimum interval requirement of 4s had already been satisfied.
  2. We wait for 3s for some reason on the test in Step 2b, and then check if we received the report. The report was already received earlier however, so we end up waiting and subsequently, timing out waiting for the change.

This test only passes when there is no CPU starvation happening, because it relies on the min interval of 4s holding off the report generation till we've gotten to the right test stage. That is really racy.

Proposal

waitForReport YAML command should really be checking for the report to have been received right from the previous call to subscribeAttribute, and not when the that test step is executed.

mrjerryjohns commented 2 years ago

FYI @vivien-apple @woody-apple @bzbarsky-apple

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

ericzijian1994 commented 1 year ago

Think the problem is still happening...

bzbarsky-apple commented 1 year ago

Yes, https://github.com/project-chip/connectedhomeip/pull/28257 is not merged yet.