project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.29k stars 1.95k forks source link

TestDiagnosticLogsDownloadCommand fails during darwin CI runs quite often #32636

Open nivi-apple opened 5 months ago

nivi-apple commented 5 months ago

The diagnostic log tests seem to fail quite frequently. Logs - https://github.com/project-chip/connectedhomeip/actions/runs/8347127817/job/22854062448?pr=32631 for an instance

Disabled in https://github.com/project-chip/connectedhomeip/pull/34260

Please re-enable in https://github.com/project-chip/connectedhomeip/blob/master/scripts/tests/chiptest/__init__.py#L170, once this is addressed.

bzbarsky-apple commented 4 months ago

This is failing on:

ERROR   21:13:51.686 - TEST OUT  :      ***** Test Step 12 : Read End User Support log intent after a failure to make sure that everything still works

Is it possible that the test is getting confused about which test the "failure" belongs with or something?

bzbarsky-apple commented 4 months ago

@vivien-apple Could you take a look, please?

harimau-qirex commented 3 months ago

I looked into this a bit, and here's a brief summary of what I believe is happening:

There are a couple things here. The immediate cause is this mystery internal error, as that causes the test to move on to the next step so the expected failure is inadvertently associated with the wrong step.

A broader issue is that the tests aren't quite detailed enough in what they're expecting. Some steps expect that an error occurs but doesn't specify what kinds of error are acceptable, so the test treats any error as the step passing rather than stop the test when the wrong errors occur.

I ran into a case where this broader issue caused the tests to erroneously succeed. One of the steps right before these transfers the network diagnostics logs normally. However, the test moved on before the all clusters app sent the final ACK for the BlockAckEof message. This meant the all clusters app didn't consider the BDX transfer completed. When the network diagnostics logs were requested again, the all clusters app responded with a busy status since the previous transfer was still ongoing. The test saw the error from the busy status and treated it as the expected timeout-related error, instead of stopping.

nivi-apple commented 2 months ago

Uploaded PR #33788 but that does not fix the issue. It looks like we might not be aborting the transfer for timeout or internal errors on the Initiator side. This needs more analysis to fix it properly. I am working on it.

nivi-apple commented 2 months ago

So I figured out why the test is failing intermittently. I am using logs from here as reference - https://github.com/project-chip/connectedhomeip/actions/runs/9372056858/job/25802710559?pr=33672.

It looks like for the test -***** Test Step 11 : Read Network Diagnostic log intent with a very short timeout and a very long log

We send a request to initiate BDX -

ERROR 19:23:46.404 - APP OUT : [1717529026404] [67942:190636] [EM] <<< [E:60809i S:61167 M:18707577] (S) Msg TX to 1:233CE82C33836403 [6D7B] [UDP:192.168.64.25:59528] --- Type 0002:01 (BDX:SendInit)

For some reason the Init fails and instead of a SendAccept we get a status report and that is not expected and hence the next test fails.

ERROR 19:23:47.411 - TEST OUT : ***** Test Step 12 : Read End User Support log intent after a failure to make sure that everything still works ERROR 19:23:47.450 - APP OUT : [1717529027450] [67942:190664] [BDX] OutputEvent type: StatusReceived ERROR 19:23:47.451 - APP OUT : [1717529027450] [67942:190664] [BDX] Diagnostic logs transfer: StatusReport Error 5f

To alleviate this, I guess we can make sure we get a sendaccept before we proceed with success tests. I will also corroborate with couple other logs to see if its the same issue

nivi-apple commented 2 months ago

also it looks like sometimes the status report gets through before reset happens. Other times the status report is not sent. I had observed this when initially writing the code for downloading logs. I need to look into the code and make sure we ensure the status report is sent out before calling reset.