project-chip / certification-tool

A test harness and tooling designed to simplify development, testing, and certification for devices, guided by the Connectivity Standards Alliance.
https://csa-iot.org/
Apache License 2.0
38 stars 23 forks source link

[Bug] Waiting time for UI Python Tests Script too long #431

Open angelus-liviu opened 2 weeks ago

angelus-liviu commented 2 weeks ago

Describe the bug

During Matter1.4 SVE testing we discovered that the waiting time for UI Python Tests Script is too long, nothing is happening in the backed, we checked the logs. One suggestion would be to have an argument that can be configured related to waiting time.

Steps to reproduce the behavior

Run UI Python Tests Script, and check the time, it will take around 2 minutes until the test starts.

Expected behavior

Test to start more quickly for improving test efficiency

Log files

No response

PICS file

No response

Screenshots

image

Environment

TH v2.11-beta3.1+fall2024 sha: 9c1ef58

Additional Information

No response

bauerschwan commented 2 weeks ago

I cannot confirm this. For me the test has virtually no waiting time, except maybe the 5 seconds or so when showing Parsing SystemCommands...

What might be is that the whole commissioning stuff does happen without any display in the backend log or the TH UI. So from the User point of view it looks like everything is stalling but in reality the backend communicates quite much with the DUT.

Please check the log of the DUT as well to see if there really is idle time involved on both sides.

LarsOesten commented 2 weeks ago
[1727367517.982296][5882:5882] CHIP:SVR: Commissioning completed successfully
[1727367517.982349][5882:5882] CHIP:DIS: Updating services using commissioning mode 0
[1727367517.992885][5882:5882] CHIP:DIS: CHIP minimal mDNS started advertising.
[1727367518.033533][5882:5882] CHIP:DL: Using WiFi MAC for hostname
[1727367518.033591][5882:5882] CHIP:DIS: Advertise operational node 2139E3573089B50B-0000000012344321
[1727367518.033642][5882:5882] CHIP:DIS: Responding with _matter._tcp.local
[1727367518.033668][5882:5882] CHIP:DIS: Responding with 2139E3573089B50B-0000000012344321._matter._tcp.local
[1727367518.033692][5882:5882] CHIP:DIS: Responding with 2139E3573089B50B-0000000012344321._matter._tcp.local
[1727367518.033708][5882:5882] CHIP:DIS: Responding with D83ADD383AE20000.local
[1727367518.033723][5882:5882] CHIP:DIS: Responding with D83ADD383AE20000.local
[1727367518.033745][5882:5882] CHIP:DIS: Responding with _I2139E3573089B50B._sub._matter._tcp.local
[1727367518.033761][5882:5882] CHIP:DIS: CHIP minimal mDNS configured as 'Operational device'; instance name: 2139E3573089B50B-0000000012344321.
[1727367518.053793][5882:5882] CHIP:DIS: mDNS service published: _matter._tcp
[1727367518.053881][5882:5882] CHIP:IN: Expiring all PASE sessions
[1727367518.053902][5882:5882] CHIP:IN: SecureSession[0xaaaafb30df40]: MarkForEviction Type:1 LSID:7508
[1727367518.053922][5882:5882] CHIP:SC: SecureSession[0xaaaafb30df40, LSID:7508]: State change 'kActive' --> 'kPendingEviction'
[1727367518.053947][5882:5882] CHIP:IN: SecureSession[0xaaaafb30df40]: Released - Type:1 LSID:7508
[1727367518.054180][5882:5882] CHIP:EM: >>> [E:9565r S:0 M:157971648 (Ack:202372015)] (U) Msg RX from 0:AD43A96D275D1FB4 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck) (B:26)
[1727367518.054377][5882:5882] CHIP:EM: Found matching exchange: 9565r, Delegate: (nil)
[1727367518.054438][5882:5882] CHIP:EM: Rxd Ack; Removing MessageCounter:202372015 from Retrans Table on exchange 9565r
[1727367518.055265][5882:5882] CHIP:EM: >>> [E:9566r S:7509 M:188517023 (Ack:258400641)] (S) Msg RX from 1:000000000001B669 [B50B] --- Type 0000:10 (SecureChannel:StandaloneAck) (B:34)
[1727367518.055307][5882:5882] CHIP:EM: Found matching exchange: 9566r, Delegate: (nil)
[1727367518.055337][5882:5882] CHIP:EM: Rxd Ack; Removing MessageCounter:258400641 from Retrans Table on exchange 9566r

[1727367574.826802][5882:5882] CHIP:EM: >>> [E:22679r S:0 M:35984329] (U) Msg RX from 0:A90E2895DB5E3F7C [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1) (B:196)
[1727367574.826880][5882:5882] CHIP:EM: Handling via exchange: 22679r, Delegate: 0xaaaad79c5048
[1727367574.826928][5882:5882] CHIP:IN: CASE Server received Sigma1 message . Starting handshake. EC 0xaaaafb30a940
[1727367574.826985][5882:5882] CHIP:SC: Received Sigma1 msg
[1727367574.827046][5882:5882] CHIP:SC: Found MRP parameters in the message
[1727367574.827086][5882:5882] CHIP:SC: Peer assigned session key ID 46897
[1727367574.827741][5882:5882] CHIP:SC: CASE matched destination ID: fabricIndex 1, NodeID 0x0000000012344321

there is break of 56 secondes where nothing happens. The commissioning is complete.

The same break is also present in the TH log

INFO       | 2024-09-26 16:18:38.691746 | Executing Test Step: Start Python test
INFO       | 2024-09-26 16:18:38.692760 | Running Python Test: TC_ACE_1_3
INFO       | 2024-09-26 16:19:25.985052 | Sending command to SDK container: python3 /root/python_testing/scripts/sdk/test_harness_client.py sdk/TC_ACE_1_3 TC_ACE_1_3 --tests test_TC_ACE_1_3 --trace-to json:log --discriminator 3840 --passcode 20202021
INFO       | 2024-09-26 16:19:33.946057 | Test Step Completed [PASSED]: Start Python test
bauerschwan commented 1 week ago

Please have a look at the logs here.

TH log: th.log

DUT log: dut.log

(please note that the times in the TH log are in GMT+0, the times in the DUT log are in GMT+2. So there is a time difference of 2 hours between them)

The only gaps I see in the TH log are

  1. 08:16:32.569 ... 08:16:48.603 (16 seconds): here the commissioning of the DUT happens (10:16:33.500 ... 10:16:48.875 in DUT log) PYTHON_TEST | 2024-09-27 08:16:32.569 | test_collections.matter.sdk_tests.support.python_testing.models.utils:handle_logs:97 | Didn't find any CertificateAuthorities in storage -- creating a new CertificateAuthority + FabricAdmin... PYTHON_TEST | 2024-09-27 08:16:48.603 | test_collections.matter.sdk_tests.support.python_testing.models.utils:handle_logs:97 | Shutting down the stack... PYTHON_TEST | 2024-09-27 08:16:48.611 | test_collections.matter.sdk_tests.support.python_testing.models.utils:handle_logs:97 | INFO:root:Final result: PASS !

  2. 08:16:50.459 ... 08:16:55.741 (5 seconds): here the parsing of the XML files happen INFO | 2024-09-27 08:16:50.123 | matter_idl.zapxml:ParseXmls:115 | Parsing /app/test_collections/matter/sdk_tests/sdk_checkout/sdk_runner/specifications/chip/global-structs.xml... ... INFO | 2024-09-27 08:16:50.459 | matter_idl.zapxml:ParseXmls:115 | Parsing SystemCommands... INFO | 2024-09-27 08:16:55.741 | test_collections.matter.sdk_tests.support.sdk_container:send_command:207 | Sending command to SDK container: python3 /root/python_testing/scripts/sdk/test_harness_client.py sdk/TC_ACE_1_3 TC_ACE_1_3 --tests test_TC_ACE_1_3 --trace-to json:log --thread-dataset-hex 000300000f02081111111122222222051000112233445566778899aabbccddeeff01021234 --endpoint 0 --nodeId 10 --manual-code 20391736404 --int-arg PIXIT_ENDPOINT:0 --bool-arg allow_provisional:true --PICS /var/tmp/pics

I guess the difference between what you see and what I see is the fact that I am using the TH running on an Mac with an Ubuntu VM. So my machine is pretty fast whereas the RPi can face slowdowns when lots of Flash access happens. And this probably happens in my second gap.

Maybe someone should look into the performance of these steps on a RPi because I do agree that a delay over 5 seconds for each test case is not ideal.

antonio-amjr commented 1 week ago

I'm taking a look into this issue for a couple of days. I wasn't able to reproduce on my environment at first, but I know why now.

Since I was using the development container for the backend (on the Rasp4), the long delay seems not to be happening for some reason. The wait time was only for 12 seconds, and that's far from the 56 seconds wait pointed out on the logs above. When I switched back to the "regular" mode instead of development, I finally managed to reproduce the problems 100%.

I'll continue to take a look on why that's happening and maybe this difference between "dev" and "regular" modes will help find the a source of this.

Regular Mode: log-all-clusters-regular-mode.txt th_regular_mode.log (added some prints to the log to verify the delay point)

Dev Mode: th_dev_mode.log log-all-clusters-dev-mode.txt (added some prints to the log to verify the delay point)