microsoft / msquic

Cross-platform, C implementation of the IETF QUIC protocol, exposed to C, C++, C# and Rust.
MIT License
4.05k stars 530 forks source link

c0000005 (Access violation) in msquic!QuicPacketBuilderPrepare #2249

Closed Digiover closed 2 years ago

Digiover commented 2 years ago

Describe the bug

After enabling HTTP3 and AltSvc response headers in the Windows registry (per instructions), the server occasionally reboots caused by a bugcheck.

Affected OS

Additional OS information

(Get-CimInstance Win32_OperatingSystem).version
10.0.20348

Also available are IIS, ASP.NET 4.8, .NET 5.0 & 6.0, .NET Core 3.1, PHP 8.1.1, 8.0.13 and 7.4.26 (all PHP is FastCgi)

MsQuic version

main

Steps taken to reproduce bug

  1. enable Http3 and AltSvc in the Windows registry, per earlier mentioned instructions:
&reg.exe add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\HTTP\Parameters" /v EnableHttp3 /t REG_DWORD /d 1 /f`
&reg.exe add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\HTTP\Parameters" /v EnableAltSvc /t REG_DWORD /d 1 /f
  1. after a day or two, the server spontaneously reboots, leaving a MEMORY.dmp behind in C:\Windows.

Expected behavior

The server should not reboot

Actual outcome

I'm no WinDbg pro, but this is what !analyze -v gave me:


Microsoft (R) Windows Debugger Version 10.0.22000.194 X86
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [D:\dumps\MEMORY.DMP]
Kernel Bitmap Dump File: Kernel address space is available, User address space may not be available.

Symbol search path is: srv*
Executable search path is: 
Unable to load image \SystemRoot\system32\ntoskrnl.exe, Win32 error 0n2
Windows 10 Kernel Version 20348 MP (12 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Edition build lab: 20348.1.amd64fre.fe_release.210507-1500
Machine Name:
Kernel base = 0xfffff804`7ea00000 PsLoadedModuleList = 0xfffff804`7f6338f0
Debug session time: Thu Dec 23 08:26:34.213 2021 (UTC + 1:00)
System Uptime: 0 days 13:25:13.645
Unable to load image \SystemRoot\system32\ntoskrnl.exe, Win32 error 0n2
Loading Kernel Symbols
...............................................................
................................................................
...
Loading User Symbols

Loading unloaded module list
.....
For analysis of this file, run !analyze -v
0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
This is a very common bugcheck.  Usually the exception address pinpoints
the driver/function that caused the problem.  Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: ffffffffc0000005, The exception code that was not handled
Arg2: fffff8048499c8a8, The address that the exception occurred at
Arg3: ffffa30592aecd78, Exception Record Address
Arg4: ffffa30592aec590, Context Record Address

Debugging Details:
------------------

KEY_VALUES_STRING: 1

    Key  : AV.Dereference
    Value: NullClassPtr

    Key  : AV.Fault
    Value: Read

    Key  : Analysis.CPU.mSec
    Value: 3280

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 17062

    Key  : Analysis.Init.CPU.mSec
    Value: 5952

    Key  : Analysis.Init.Elapsed.mSec
    Value: 119792

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 77

    Key  : WER.OS.Branch
    Value: fe_release

    Key  : WER.OS.Timestamp
    Value: 2021-05-07T15:00:00Z

    Key  : WER.OS.Version
    Value: 10.0.20348.1

BUGCHECK_CODE:  7e

BUGCHECK_P1: ffffffffc0000005

BUGCHECK_P2: fffff8048499c8a8

BUGCHECK_P3: ffffa30592aecd78

BUGCHECK_P4: ffffa30592aec590

EXCEPTION_RECORD:  ffffa30592aecd78 -- (.exr 0xffffa30592aecd78)
ExceptionAddress: fffff8048499c8a8 (msquic!QuicPacketBuilderPrepare+0x0000000000000668)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 0000000000000021
Attempt to read from address 0000000000000021

CONTEXT:  ffffa30592aec590 -- (.cxr 0xffffa30592aec590)
rax=ffffcb09b5d40628 rbx=ffffa30592aed0e0 rcx=0000000000000002
rdx=ffffa30592aed278 rsi=0000000000000000 rdi=ffffa30592aed259
rip=fffff8048499c8a8 rsp=ffffa30592aecfb0 rbp=0000000000000000
 r8=0000000000000000  r9=00000000000004d0 r10=ffffcb09afbc7120
r11=ffffcb09b21dd0d8 r12=ffffcb09b5d404ff r13=ffffcb09c331d160
r14=0000000000000000 r15=00000000000000a7
iopl=0         nv up ei pl nz ac pe nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00050212
msquic!QuicPacketBuilderPrepare+0x668:
fffff804`8499c8a8 440fb67521      movzx   r14d,byte ptr [rbp+21h] ss:0018:00000000`00000021=??
Resetting default scope

BLACKBOXBSD: 1 (!blackboxbsd)

BLACKBOXNTFS: 1 (!blackboxntfs)

BLACKBOXPNP: 1 (!blackboxpnp)

BLACKBOXWINLOGON: 1

PROCESS_NAME:  System

READ_ADDRESS:  0000000000000021 

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR:  c0000005

EXCEPTION_PARAMETER1:  0000000000000000

EXCEPTION_PARAMETER2:  0000000000000021

EXCEPTION_STR:  0xc0000005

STACK_TEXT:  
ffffa305`92aecfb0 fffff804`849c8a5c     : ffffcb09`b5d40c38 ffffcb09`000004d0 ffffcb09`b5d40500 ffffcb09`b5d40400 : msquic!QuicPacketBuilderPrepare+0x668
ffffa305`92aed070 fffff804`849c7a69     : 9dd90faf`08abbfcb ffffcb09`b5d40d01 9dd90faf`08abbfcb ffffa305`00000003 : msquic!QuicPacketBuilderPrepareForControlFrames+0x3c
ffffa305`92aed0a0 fffff804`8499b8d9     : 00000000`00000000 ffffcb09`b5d405a8 ffffcb09`b5d40de8 ffffcb09`b5d40de8 : msquic!QuicSendPathChallenges+0xad
ffffa305`92aed3f0 fffff804`8499a943     : fffff804`84990000 ffffcb09`afbee040 ffffcb09`00000000 00000000`00000006 : msquic!QuicSendFlush+0x189
ffffa305`92aed8e0 fffff804`849a62e6     : ffffcb09`b5d404c0 ffffffff`ffffffff 00000000`00000001 0624dd2f`1a9fbe77 : msquic!QuicConnDrainOperations+0x303
ffffa305`92aed9e0 fffff804`849a60ae     : 00000000`00000000 ffffcb09`afbee040 ffffcb09`afbee040 ffffcb09`b5d40400 : msquic!QuicWorkerProcessConnection+0x126
ffffa305`92aedae0 fffff804`7ece8375     : ffffcb09`afbc9100 00000000`00000080 fffff804`849a5be0 ffffcb09`afbee040 : msquic!QuicWorkerThread+0x4ce
ffffa305`92aedbf0 fffff804`7ee1a468     : fffff804`7bfb1180 ffffcb09`afbc9100 fffff804`7ece8320 00000000`00000000 : nt!PspSystemThreadStartup+0x55
ffffa305`92aedc40 00000000`00000000     : ffffa305`92aee000 ffffa305`92ae8000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x28

SYMBOL_NAME:  msquic!QuicPacketBuilderPrepare+668

MODULE_NAME: msquic

IMAGE_NAME:  msquic.sys

STACK_COMMAND:  .cxr 0xffffa30592aec590 ; kb

BUCKET_ID_FUNC_OFFSET:  668

FAILURE_BUCKET_ID:  AV_msquic!QuicPacketBuilderPrepare

OS_VERSION:  10.0.20348.1

BUILDLAB_STR:  fe_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {7cbcdf2e-09bf-fd2a-a333-546ade1cd1a1}

Followup:     MachineOwner
---------

0: kd> lmvm msquic
Browse full module list
start             end                 module name
fffff804`84990000 fffff804`849f2000   msquic   # (pdb symbols)          D:\Windows Kits\10\Debuggers\x86\sym\msquic.pdb\366D19B0DF05481599A99B74432CF64E1\msquic.pdb
    Loaded symbol image file: msquic.sys
    Mapped memory image file: D:\Windows Kits\10\Debuggers\x86\sym\msquic.sys\60EF903662000\msquic.sys
    Image path: \SystemRoot\system32\drivers\msquic.sys
    Image name: msquic.sys
    Browse all global symbols  functions  data
    Timestamp:        Thu Jul 15 03:32:38 2021 (60EF9036)
    CheckSum:         00063FEF
    ImageSize:        00062000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4
    Information from resource tables:

Additional details

There is not much HTTP3 web traffic on the web server, yet. According to our \quic performance diagnostics\quic connections connected and \quic performance diagnostics\quic streams active Zabbix monitoring, "quic connections connected" maxed at 20 and 119 "quic streams active".

Nuklon commented 2 years ago

Same problem for me: https://github.com/microsoft/msquic/issues/2188. I had reported this to MSRC but it was closed as it wasn't a security issue.

nibanks commented 2 years ago

Yes, we've decided it wasn't a big enough issue to be a security issue. It happened to already be fixed on the latest versions. We'll likely still push a fix to older versions soon enough. But if there's ever a chance of a crash being a security issue, please don't open it on GitHub and follow the security instructions to report it to MSRC.

Digiover commented 2 years ago

Yes, we've decided it wasn't a big enough issue to be a security issue. It happened to already be fixed on the latest versions. We'll likely still push a fix to older versions soon enough. But if there's ever a chance of a crash being a security issue, please don't open it on GitHub and follow the security instructions to report it to MSRC.

Thanks. How do we know this is a security issue? Or are all access violations (c0000005) security issues?

(you may close this issue with a comment)

Nuklon commented 2 years ago

Yes, we've decided it wasn't a big enough issue to be a security issue. It happened to already be fixed on the latest versions. We'll likely still push a fix to older versions soon enough. But if there's ever a chance of a crash being a security issue, please don't open it on GitHub and follow the security instructions to report it to MSRC.

Well, it renders quic completely useless on WS2022 as the machine will crash, so it's a significant issue I'd say. When you talk about "latest versions", do you mean WS2022 or msquic? I have fully updated WS2022 and the issue still exists. What would be the timeline to get this fixed on WS2022? If this is taking long, is there a way to manually update msquic driver? I only see the DLLs available on releases. And which commit has fixed this bug?

nibanks commented 2 years ago

How do we know this is a security issue? Or are all access violations (c0000005) security issues?

Any crash has a good chance of being a security issue if we can figure out how it might be triggered remotely, or even if it can't but there's a good chance of it happening randomly (i.e. race condition).

Nuklon commented 2 years ago

It seems to me then that it is a security issue? As it can be triggered remotely with HTTP requests. Also: what about my questions above? Thank you.

nibanks commented 2 years ago

It can't be easily exploited on purpose by an attacker though. That's why it's not significant enough to be considered a security issue.

nibanks commented 2 years ago

Also, feel free to email me if you'd like me to look at any future issues to let you know if it could be a security issue, if you're unsure.

Nuklon commented 2 years ago

Alright, fair enough.

When you talk about "latest versions", do you mean WS2022 or msquic? I have fully updated WS2022 and the issue still exists. What would be the timeline to get this fixed on WS2022? If this is taking long, is there a way to manually update msquic driver? I only see the DLLs available on releases.

nibanks commented 2 years ago

The latest MsQuic version. We still plan to service WS2022 with the fix. Just got to work out the logistics. There's no way for someone to update Windows drivers out of band right now.

Digiover commented 2 years ago

@nibanks Please keep us posted on the WS2022 roll-out for the fix.

nibanks commented 2 years ago

Will do.

Nuklon commented 2 years ago

So now that it's closed, any update on roll out date @nibanks?

nibanks commented 2 years ago

I'm working on it. 😄 I will definitely let you know when I have a roll out date.

nibanks commented 2 years ago

@digiover and @nuklon we got the fix approved for servicing and should hopefully be available in a month or two.

Nuklon commented 2 years ago

Great, thanks @nibanks. Will you provide an update when it's actually serviced? So we don't have to guess.

nibanks commented 2 years ago

Definitely.

nibanks commented 2 years ago

FYI @Digiover and @Nuklon the fix should be out with the Windows April monthly update, next month. It will require manual update, or will automatically happen with the following month's update (as I understand it). When it all goes live, we will share instructions.

Digiover commented 2 years ago

FYI @Digiover and @Nuklon the fix should be out with the Windows April monthly update, next month. It will require manual update, or will automatically happen with the following month's update (as I understand it). When it all goes live, we will share instructions.

Thanks @nibanks! Any information on how to update, or is it wrapped in a Window Update package for this month?

Nuklon commented 2 years ago

msquic.sys seems to be updated with latest KB, it's now signed March 8, and version 1.0.4. Some clarification from @nibanks on whether the fix is in place is welcome 👀

nibanks commented 2 years ago

Yes @nuklon, that should be correct. I believe the version should be 1.0.4.233914-official in the Product version of the .sys file.

Nuklon commented 2 years ago

Thanks @nibanks, 2 days stable now 👍

Nuklon commented 2 years ago

Not sure if this is related to this fix or not, so if you'd like me to open a new issue, I can do so.

Since running this new version, I've had to restart the server twice due to memory getting to 100%. I've looked into it with poolmon and one tag uses ~32GB of RAM:

 Memory:67018228K Avail:  270612K  PageFlts:  9355   InRam Krnl:47756K P:390536K
 Commit:72832896K Limit:84635012K Peak:80802964K            Pool N:37417736K P:635236K
 System pool information
 Tag  Type     Allocs            Frees            Diff       Bytes                  Per Alloc

 Qc1A Nonp   92404489 (   0)      6827 (   0) 92397662 34002339616 (          0)         368

Sorry for bad formatting, I couldn't get this formatted correctly. Here's the screenshot:

image

Since the server worked correctly before without any problems, I suspected it was due to msquic, and looking at the code it seems this tag is indeed used by msquic: https://github.com/microsoft/msquic/blob/main/src/inc/quic_platform.h#L96

I do not have a memory dump. I tried creating one with notmyfault but the server crashed/rebooted immediately without writing any dump, and I have since disabled quic in IIS again.

My setup is still the same, ASP.NET Core 6.0, IIS, WS2022 latest updates. I have also disabled legacy TLS on most sites.

nibanks commented 2 years ago

@nuklon please open a new issue for this. I've never seen it before. This seems to indicate that sends at the UDP layer are getting leaked. Since the sends are unconditionally freed (at least to the pool) in the completion path (CxPlatDataPathSendComplete to CxPlatSendDataFree), it must mean the IRP isn't getting completed.

cc @thhous-msft

nibanks commented 2 years ago

@Nuklon also please include information about your NIC HW and driver version with the new issue. It could be a driver issue, not returning/completing the UDP send NBLs.

Nuklon commented 2 years ago

Thanks @nibanks, I opened #2658, let's continue there.