microsoft / windows-container-tools

Collection of tools to improve the Windows Containers experience
MIT License
240 stars 68 forks source link

Logmonitor possibly causing server to crash #195

Open SRJames opened 1 day ago

SRJames commented 1 day ago

Describe the bug We have intermittent issues with Windows servers restarting. A memory dump created at the time of restart, analyzed with windbg, had the contents below. There are two mentions of Logmonitor.exe

***** Preparing the environment for Debugger Extensions Gallery repositories ** ExtensionRepository : Implicit UseExperimentalFeatureForNugetShare : true AllowNugetExeUpdate : true NonInteractiveNuget : true AllowNugetMSCredentialProviderInstall : true AllowParallelInitializationOfLocalRepositories : true

EnableRedirectToV8JsProvider : false

-- Configuring repositories ----> Repository : LocalInstalled, Enabled: true ----> Repository : UserExtensions, Enabled: true

Preparing the environment for Debugger Extensions Gallery repositories completed, duration 0.016 seconds

***** Waiting for Debugger Extensions Gallery to Initialize **

Waiting for Debugger Extensions Gallery to Initialize completed, duration 0.015 seconds ----> Repository : UserExtensions, Enabled: true, Packages count: 0 ----> Repository : LocalInstalled, Enabled: true, Packages count: 29

Microsoft (R) Windows Debugger Version 10.0.26100.1742 X86 Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\Windows\MEMORY.DMP] Kernel Bitmap Dump File: Kernel address space is available, User address space may not be available.

Primary dump contents written successfully

***** Path validation summary ** Response Time (ms) Location Deferred srvhttps://msdl.microsoft.com/download/symbols Symbol search path is: srvhttps://msdl.microsoft.com/download/symbols Executable search path is: Windows 10 Kernel Version 20348 MP (8 procs) Free x64 Product: Server, suite: TerminalServer DataCenter SingleUserTS Edition build lab: 20348.859.amd64fre.fe_release_svc_prod2.220707-1832 Kernel base = 0xfffff80342000000 PsLoadedModuleList = 0xfffff80342c33a10 Debug session time: Mon Sep 30 12:12:52.474 2024 (UTC + 0:00) System Uptime: 0 days 5:02:02.125 Loading Kernel Symbols ............................................................... ................................................................ .........................Page 106689 not present in the dump file. Type ".hh dbgerr004" for details ....................................... ......................................... Loading User Symbols

Loading unloaded module list ....................... For analysis of this file, run !analyze -v 5: kd> !analyze -v


PROCESS_HAS_LOCKED_PAGES (76) Caused by a driver not cleaning up correctly after an I/O. Arguments: Arg1: 0000000000000000, Locked memory pages found in process being terminated. Arg2: ffff900b45fd73c0, Process address. Arg3: 0000000000000003, Number of locked pages. Arg4: 0000000000000000, Pointer to driver stacks (if enabled) or 0 if not. Issue a !search over all of physical memory for the current process pointer. This will yield at least one MDL which points to it. Then do another !search for each MDL found, this will yield the IRP(s) that point to it, revealing which driver is leaking the pages. Otherwise, set HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\TrackLockedPages to a DWORD 1 value and reboot. Then the system will save stack traces so the guilty driver can be easily identified. When you enable this flag, if the driver commits the error again you will see a different BugCheck - DRIVER_LEFT_LOCKED_PAGES_IN_PROCESS (0xCB) - which can identify the offending driver(s).

Debugging Details:

KEY_VALUES_STRING: 1

Key  : Analysis.CPU.mSec
Value: 6765

Key  : Analysis.Elapsed.mSec
Value: 6744

Key  : Analysis.IO.Other.Mb
Value: 0

Key  : Analysis.IO.Read.Mb
Value: 0

Key  : Analysis.IO.Write.Mb
Value: 0

Key  : Analysis.Init.CPU.mSec
Value: 2312

Key  : Analysis.Init.Elapsed.mSec
Value: 59991

Key  : Analysis.Memory.CommitPeak.Mb
Value: 83

Key  : Bugcheck.Code.KiBugCheckData
Value: 0x76

Key  : Bugcheck.Code.LegacyAPI
Value: 0x76

Key  : Dump.Attributes.AsUlong
Value: 1000

Key  : Dump.Attributes.DiagDataWrittenToHeader
Value: 1

Key  : Dump.Attributes.ErrorCode
Value: 0

Key  : Dump.Attributes.LastLine
Value: Dump completed successfully.

Key  : Dump.Attributes.ProgressPercentage
Value: 100

Key  : Failure.Bucket
Value: 0x76_LogMonitor.exe_nt!MmDeleteProcessAddressSpace

Key  : Failure.Hash
Value: {4dc10a00-52b6-9324-e14f-5fd73286e667}

Key  : Hypervisor.Enlightenments.Value
Value: 8496

Key  : Hypervisor.Enlightenments.ValueHex
Value: 2130

Key  : Hypervisor.Flags.AnyHypervisorPresent
Value: 1

Key  : Hypervisor.Flags.ApicEnlightened
Value: 1

Key  : Hypervisor.Flags.ApicVirtualizationAvailable
Value: 0

Key  : Hypervisor.Flags.AsyncMemoryHint
Value: 0

Key  : Hypervisor.Flags.CoreSchedulerRequested
Value: 0

Key  : Hypervisor.Flags.CpuManager
Value: 0

Key  : Hypervisor.Flags.DeprecateAutoEoi
Value: 0

Key  : Hypervisor.Flags.DynamicCpuDisabled
Value: 1

Key  : Hypervisor.Flags.Epf
Value: 0

Key  : Hypervisor.Flags.ExtendedProcessorMasks
Value: 0

Key  : Hypervisor.Flags.HardwareMbecAvailable
Value: 0

Key  : Hypervisor.Flags.MaxBankNumber
Value: 0

Key  : Hypervisor.Flags.MemoryZeroingControl
Value: 0

Key  : Hypervisor.Flags.NoExtendedRangeFlush
Value: 1

Key  : Hypervisor.Flags.NoNonArchCoreSharing
Value: 1

Key  : Hypervisor.Flags.Phase0InitDone
Value: 1

Key  : Hypervisor.Flags.PowerSchedulerQos
Value: 0

Key  : Hypervisor.Flags.RootScheduler
Value: 0

Key  : Hypervisor.Flags.SynicAvailable
Value: 0

Key  : Hypervisor.Flags.UseQpcBias
Value: 0

Key  : Hypervisor.Flags.Value
Value: 4730893

Key  : Hypervisor.Flags.ValueHex
Value: 48300d

Key  : Hypervisor.Flags.VpAssistPage
Value: 1

Key  : Hypervisor.Flags.VsmAvailable
Value: 0

Key  : Hypervisor.RootFlags.AccessStats
Value: 0

Key  : Hypervisor.RootFlags.CrashdumpEnlightened
Value: 0

Key  : Hypervisor.RootFlags.CreateVirtualProcessor
Value: 0

Key  : Hypervisor.RootFlags.DisableHyperthreading
Value: 0

Key  : Hypervisor.RootFlags.HostTimelineSync
Value: 0

Key  : Hypervisor.RootFlags.HypervisorDebuggingEnabled
Value: 0

Key  : Hypervisor.RootFlags.IsHyperV
Value: 0

Key  : Hypervisor.RootFlags.LivedumpEnlightened
Value: 0

Key  : Hypervisor.RootFlags.MapDeviceInterrupt
Value: 0

Key  : Hypervisor.RootFlags.MceEnlightened
Value: 0

Key  : Hypervisor.RootFlags.Nested
Value: 0

Key  : Hypervisor.RootFlags.StartLogicalProcessor
Value: 0

Key  : Hypervisor.RootFlags.Value
Value: 0

Key  : Hypervisor.RootFlags.ValueHex
Value: 0

Key  : SecureKernel.HalpHvciEnabled
Value: 0

Key  : WER.OS.Branch
Value: fe_release_svc_prod2

Key  : WER.OS.Version
Value: 10.0.20348.859

BUGCHECK_P1: 0

BUGCHECK_P2: ffff900b45fd73c0

BUGCHECK_P3: 3

BUGCHECK_P4: 0

FILE_IN_CAB: MEMORY.DMP

DUMP_FILE_ATTRIBUTES: 0x1000

PROCESS_NAME: LogMonitor.exe

BLACKBOXBSD: 1 (!blackboxbsd)

BLACKBOXNTFS: 1 (!blackboxntfs)

BLACKBOXPNP: 1 (!blackboxpnp)

BLACKBOXWINLOGON: 1

STACK_TEXT:
ffffc283613b20c8 fffff803428be6b5 : 0000000000000076 0000000000000000 ffff900b45fd73c0 0000000000000003 : nt!KeBugCheckEx ffffc283613b20d0 fffff80342798f01 : ffff900b45fd7808 ffffc283613b2190 ffff900b2945e040 ffff900b45fd73c0 : nt!MmDeleteProcessAddressSpace+0x126845 ffffc283613b2120 fffff803427c9b50 : ffff900b45fd7390 ffff900b45fd7390 0000000000000000 0000000000000000 : nt!PspProcessDelete+0x171 ffffc283613b21c0 fffff80342376c67 : 0000000000000000 0000000000000000 ffff900b45fd77f8 ffff900b45fd73c0 : nt!ObpRemoveObjectRoutine+0x80 ffffc283613b2220 fffff803427a6e3c : 0000000000000000 ffff900b526c55f8 0000000000000000 ffff900b526c55f8 : nt!ObfDereferenceObjectWithTag+0xc7 ffffc283613b2260 fffff803427c9b50 : ffff900b526c5090 ffff900b526c5090 fffff80342c263c0 0000000000000000 : nt!PspThreadDelete+0x33c ffffc283613b22d0 fffff80342376c67 : 0000000000000000 0000000000000000 fffff80342c263c0 ffff900b526c50c0 : nt!ObpRemoveObjectRoutine+0x80 ffffc283613b2330 fffff803422ad482 : 0000000000000000 0000000000000000 0000000000000000 fffff80342c2f0e0 : nt!ObfDereferenceObjectWithTag+0xc7 ffffc283613b2370 fffff803422f5151 : ffff900b2945e040 fffff80342d3d6c0 fffff80300000000 fffff80300000000 : nt!PspReaper+0x72 ffffc283613b23a0 fffff803422757d5 : ffff900b2945e040 0000000000000000 ffff900b2945e040 0000000000000080 : nt!ExpWorkerThread+0x161 ffffc283613b25b0 fffff80342425458 : ffffb100b9929180 ffff900b2945e040 fffff80342275780 0000000000000000 : nt!PspSystemThreadStartup+0x55 ffffc283613b2600 0000000000000000 : ffffc283613b3000 ffffc283613ac000 0000000000000000 0000000000000000 : nt!KiStartSystemThread+0x28

SYMBOL_NAME: nt!MmDeleteProcessAddressSpace+126845

MODULE_NAME: nt

STACK_COMMAND: .cxr; .ecxr ; kb

IMAGE_NAME: ntkrnlmp.exe

BUCKET_ID_FUNC_OFFSET: 126845

FAILURE_BUCKET_ID: 0x76_LogMonitor.exe_nt!MmDeleteProcessAddressSpace

OS_VERSION: 10.0.20348.859

BUILDLAB_STR: fe_release_svc_prod2

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

FAILURE_ID_HASH: {4dc10a00-52b6-9324-e14f-5fd73286e667}

Configuration -Tool: LogMontior -Version: 2.0.2

Additional context Logmonitor is used as our entrypoint to some of our containers. These are running in AWS EKS on this AMI: Windows_Server-2022-English-Full-EKS_Optimized-1.30-2024.09.10 When this issue occurs, the EC2 restarts but becomes unavailable in the cluster

iankingori commented 1 day ago

Looking into this @SRJames, mind sharing the memory dump through this email address: WindowsContainerGitHubIssues@service.microsoft.com?

SRJames commented 1 day ago

Will try. It is 2Gb

SRJames commented 1 day ago

Email sent.

SRJames commented 1 day ago

Some additional info. In the System Event log that reported the reboot there was this Error entry.

The computer has rebooted from a bugcheck. The bugcheck was: 0x00000076 (0x0000000000000000, 0xffff900b45fd73c0, 0x0000000000000003, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: eb5eb746-1c40-4df9-b275-94ff106ea2a3.

profnandaa commented 22 hours ago

Caused by a driver not cleaning up correctly after an I/O.

Do you know which driver this is? Inbox driver or installed?