microsoft / msquic

Cross-platform, C implementation of the IETF QUIC protocol, exposed to C, C++, C# and Rust.
MIT License
4.05k stars 530 forks source link

Misc.StreamReliableReset Test Failed #3891

Closed nibanks closed 1 year ago

nibanks commented 1 year ago

Describe the bug

During a unrelated PR run, this test failed once (in kernel mode) and 2 other kernel builds crashed ([1] [2]), which may be related (we don't know for sure).

From quic.log, I see:

[1]18E8.055C::2023/10/01-19:31:55.303198100 [Microsoft-Quic][test] File: D:\a\msquic\msquic\src\test\lib\DataTest.cpp, Function: QuicTestStreamReliableReset, Line: 3398
[1]18E8.055C::2023/10/01-19:31:55.303200300 [Microsoft-Quic][test] FAIL: Context.ClientStreamShutdownComplete.WaitTimeout(TestWaitTimeout) not true
[1]18E8.055C::2023/10/01-19:31:55.303271700 [Microsoft-Quic][test]<--- Try Reliably Shutting Down Stream 1

I noticed a few things from the logs:

  1. We shouldn't be trying to run any drop-based logic on Release mode (since it's not supported). I forgot about this when I did the review.
  2. In the first loop of the test, even though the app indicated a reliable offset of 5000, the send logic still sends all 10000 bytes. It should stop sending at the reliable offset to save space.
  3. QUIC_TRACE_API_TYPE is not in sync with map_QUIC_TRACE_API_TYPE in the ETW manifest.
  4. In the second loop, which eventually, times out, it looks to be because the STOP_SENDING frame was never sent from the client, even though it called the API to abort the receive. This is the real bug that caused the test failures.
  5. This event, [Microsoft-Quic][strm][0xFFFF9F021563E5E0] Reset Reliable ACKed in OnResetReliableAck. Send side. UnAckedOffset=7077, ReliableOffsetSend=5000, isn't necessary because (1) you log the reliable offset when it's originally set,a nd (2) we already log the unacked offset: [Microsoft-Quic][strm][0xFFFF9F021563E5E0] SF:0 FC:72613 QS:10000 MAX:9890 UNA:7077 NXT:9890 RECOV:0-0. Perhaps we should just update that second log to include reliable offset (though I'm not sure).

Affected OS

Additional OS information

Windows kernel mode

MsQuic version

main

Steps taken to reproduce bug

Run automation.

Expected behavior

All tests pass

Actual outcome

One test doesn't and we also see crashes.

Additional details

No response

ProjectsByJackHe commented 1 year ago

Is it plausible that the test failure D:\a\msquic\msquic\src\test\bin\quic_gtest.cpp(2123): error: Value of: DriverClient.Run(( ((0x00000012) << 16) | ((( 0x0002 )) << 14) | ((114) << 2) | (0) )) led to the chain reaction that gave the erroneous logs?

Also interesting how this same test passes in another PR, wondering if the DriverClient.run is the flaky part? https://github.com/microsoft/msquic/actions/runs/6366747961/job/17284890779?pr=3889

I am doubting whether its the core code that led to this, or the flaky test code (DriverClient.run error) that led to this.

Would be nice to get a test failure of JUST the datatest client timeout, without additional confounding variables like the DriverClient failure.