virtio-win / kvm-guest-drivers-windows

Windows paravirtualized drivers for QEMU\KVM
https://www.linux-kvm.org/page/WindowsGuestDrivers
BSD 3-Clause "New" or "Revised" License
2.01k stars 385 forks source link

netkvm with lots of parallel network activity hanging Win10 VM #1058

Open dasoussan opened 6 months ago

dasoussan commented 6 months ago

HangCallStacks-Full.txt Scenario summary:

Ubuntu 22.04 running Oracle Virtual Box with a Win10 VM running underneath. 4 cores & 16 GB RAM allocated to the VM, 32 GB total in the machine. We believe via crash dump analysis that the NetKVM / VirtIO drivers are hanging up spinlocked & waiting for virtqueue_get_buf_split.

Details (sorry, this is long - I'm summarizing as best I can):

The VM is running our own code that previously ran in Win7 and runs in Win10 stand alone fine, now we are migrating portions (eventually all of it) over to Linux but for now are in a hybrid environment. The hardware emulated NICs weren't giving the performance we wanted as this main system has 2 x 10G fiber interfaces and when we tried NetKVM / VirtIO drivers things performed sooooo much better so that is where we went.

With ~30 devices on the network every now and then we would occasionally see some kind of VM hang up. VirtualBox would say:

VMMDev: vmmDevHeartbeatFlatlinedTimer: Guest seems to be unresponsive. Last heartbeat received 4 seconds ago. We thought it was a virtualbox or file system problem as we saw the hang when 30-50 devices were all trying to pull files from the main server.

On a system with ~200+ devices on the network it is hanging up more often and now not when they are pulling files but when the system is "idle". I quote "Idle" because even when idle the server is constantly querying every device every few seconds with a ping and or a 'How are you doing today?' message, devices are constantly saying what their status is via multicast messages that are picked up by any interested listening device and most are listening to the multicast messages.

Thus even an idle system while not a high GB/s amount of network traffic is a lot of messages per unit time, a quick sniff and I counted > 350 packets per second.

With the hang happening more often I've instrumented the Win10 VM to be able to generate a crash dump on NMI.

Hang: VM seems running but is completely unresponsive to any keyboard or mouse input, the clock has stopped updating on the bottom right of the Win10 screen, network stack does not respond to any pings, we see no data egressing from the VM when checked with wireshark. The VM does respond to the NMI, so I've got that going for me.

A lot of hoops and spelunking later I managed to find a symbol file for the version of netkvm we were running, get it recognized by WinDBG, and can finally show call stacks for the 4 VCPUs with actual function names and line numbers VS what I was seeing.

Here is the call stack that received the NMI - I filtered out all the not interesting hex stuff - so the first 8 stack entries are all related to the NMI and generating the crash dump. Right under that it is in netkvm:

0: kd> !thread
THREAD fffff80369727a00  Cid 0000.0000  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap                 ffffb20b8ee36540
Owning Process            fffff80369724a00       Image:         Idle
Attached Process          ffffde034267d040       Image:         System
Wait Start TickCount      128224         Ticks: 334059 (0:01:26:59.671)
Context Switch Count      528260         IdealProcessor: 0             
UserTime                  00:00:00.000
KernelTime                00:29:47.765
Win32 Start Address nt!KiIdleLoop (0xfffff80368e01f10)
Stack Init fffff8036c874c90 Current fffff8036c874c20
Base fffff8036c875000 Limit fffff8036c86f000 Call 0000000000000000
Priority 0 BasePriority 0 PriorityDecrement 0 IoPriority 0 PagePriority 5
: Call Site
: nt!KeBugCheckEx
: nt!HalBugCheckSystem+0x7a
: PSHED!PshedBugCheckSystem+0x10
: nt!WheaReportHwError+0x46e
: nt!HalHandleNMI+0x142
: nt!KiProcessNMI+0x134
: nt!KxNmiInterrupt+0x82
: nt!KiNmiInterrupt+0x212 (TrapFrame @ fffff803`6c897e50)
: netkvm!virtqueue_get_buf_split+0x9d [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\VirtIO\VirtIORing.c @ 325] 
: netkvm!virtqueue_get_buf+0x5f (Inline Function @ fffff803`6ea01c26) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\Virtio\virtio.h @ 87] 
: netkvm!CVirtQueue::GetBuf+0x63 (Inline Function @ fffff803`6ea01c26) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-VirtQueue.h @ 208] 
: netkvm!CTXVirtQueue::ReleaseTransmitBuffers+0xae [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_VirtQueue.cpp @ 294] 
: netkvm!CTXVirtQueue::ProcessTXCompletions+0x38 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_VirtQueue.cpp @ 327] 
: netkvm!CParaNdisTX::DoPendingTasks::__l2::<lambda_2bb8197fb2982050d71283ed06b8d199>::operator()+0x10 (Inline Function @ fffff803`6e9fee61) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 725] 
: netkvm!CParaNdisTX::DoWithTXLock+0x1a (Inline Function @ fffff803`6e9fee61) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-TX.h @ 237] 
: netkvm!CParaNdisTX::DoPendingTasks+0x75 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 723] 
: netkvm!CParaNdisTX::NBLMappingDone+0x3f [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 432] 
: netkvm!CNBL::RegisterMappedNB+0x27 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 93] 
: nt!HalpAllocateAdapterCallbackV3+0xa9
: nt!HalpBuildScatterGatherList+0x1af
: nt!HalBuildScatterGatherListV3+0x73
: ndis!NdisMAllocateNetBufferSGList+0x299
: netkvm!CNB::ScheduleBuildSGListForTx+0x2f [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 784] 
: netkvm!CNBL::StartMapping::__l2::<lambda_3029174f739c49f49e8fda0852d91fd3>::operator()+0x8 (Inline Function @ fffff803`6ea004d8) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 285] 
: netkvm!CNdisList<CNB,CRawAccess,CNonCountingObject>::ForEachPrepareIf+0x1e (Inline Function @ fffff803`6ea004d8) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 465] 
: netkvm!CNdisList<CNB,CRawAccess,CNonCountingObject>::ForEach+0x1e (Inline Function @ fffff803`6ea004d8) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 446] 
: netkvm!CNBL::StartMapping+0x48 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 283] 
: netkvm!CParaNdisTX::Send+0x13c [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 411] 
: netkvm!ParaNdis6_SendNetBufferLists+0x83 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\wlh\ParaNdis6_Driver.cpp @ 460] 
: ndis!ndisMSendNBLToMiniportInternal+0xff
: ndis!ndisMSendNBLToMiniport+0xe
: ndis!ndisCallSendHandler+0x59
: ndis!ndisInvokeNextSendHandler+0x191
: ndis!NdisSendNetBufferLists+0x17d
: tcpip!FlpSendPacketsHelper+0x2bd
: tcpip!Fl48pSendArpPacket+0xcb
: tcpip!Fl48SendNeighborSolicitation+0x22
: tcpip!Ipv4pSendNeighborSolicitation+0x60
: tcpip!IppSendNeighborSolicitation+0xf3
: tcpip!IppNeighborSetTimeout+0x3c4

The other 3 call stacks all look similar:

1: kd> !thread
THREAD ffffc901089cb140  Cid 0000.0000  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 1
Not impersonating
DeviceMap                 ffffb20b8ee36540
Owning Process            fffff80369724a00       Image:         Idle
Attached Process          ffffde034267d040       Image:         System
Wait Start TickCount      125774         Ticks: 336509 (0:01:27:37.953)
Context Switch Count      497827         IdealProcessor: 1             
UserTime                  00:00:00.000
KernelTime                00:29:53.109
Win32 Start Address nt!KiIdleLoop (0xfffff80368e01f10)
Stack Init ffffb00fd4e29c90 Current ffffb00fd4e29c20
Base ffffb00fd4e2a000 Limit ffffb00fd4e24000 Call 0000000000000000
Priority 0 BasePriority 0 PriorityDecrement 0 IoPriority 0 PagePriority 0
Child-SP          RetAddr           : Args to Child                                                           : Call Site
: nt!KxWaitForSpinLockAndAcquire+0x2f
: nt!KeAcquireSpinLockAtDpcLevel+0x5c
: netkvm!CNdisSpinLock::LockDPR+0xa (Inline Function @ fffff803`6e9fee51) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 137] 
: netkvm!CDPCSpinLockedContext::{ctor}+0xa (Inline Function @ fffff803`6e9fee51) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-Util.h @ 226] 
: netkvm!CParaNdisTX::DoWithTXLock+0xa (Inline Function @ fffff803`6e9fee51) [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis-TX.h @ 236] 
: netkvm!CParaNdisTX::DoPendingTasks+0x65 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_TX.cpp @ 723] 
: netkvm!ParaNdis_RXTXDPCWorkBody+0xd1 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_Common.cpp @ 1811] 
: netkvm!MiniportInterruptDPC+0x5f [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\wlh\ParaNdis6_Impl.cpp @ 380] 
: ndis!ndisInterruptDpc+0x3c58a
: nt!KiExecuteAllDpcs+0x30e
: nt!KiRetireDpcList+0x1f4
: nt!KiIdleLoop+0x9e

I'm attaching the full kernel mode call stack, included the above as a teaser. I don't have a complete memory dump of a hang as that necessitates increasing the pagefile size by a non-trivial amount and that could make the bug disappear, so I'm being as minimally invasive as possible right now.

This all happened with the Red Hat NDIS Miniport Driver version 100.92.104.22900 (as reported by the netkvm.sys file properties in Windows 10) which I believe corresponds to version 229 of the driver and associated source / executable files I found here:

https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/

I saw that 229 dates back to when I was looking to get better performance and see there is a v248 driver out now. I took both code bases and did some side by side comparisons and while I didn't see any differences in VirtIORing.c I did see some in ParaNdis_TX.cpp nearby to where we hang but didn't drill into the code in detail. There were enough differences I'm forking a thread attempting to reproduce the problem with the v248 driver from the same site and while it hasn't hung up yet it hasn't been cooking for very long and this is quite an evasive intermittent issue.

I thought another thread might be to detail what I've seen here on the off chance some hang condition has already been detected and fixed and I just couldn't find such a scenario in the various bugs I've perused through.

I'll post back if I get another hang.

Thank you! David Soussan

dasoussan commented 6 months ago

Over the weekend, I created a stress test scenario and can now replicate the problem on multiple machines with both the older and newer drivers. I haven't done the crash dump analysis on the .dmp files yet, but wanted to get this info into your hands sooner rather than later.

The stress test consists of a batch file spawning 50-70 iperf3 instances on different ports in Windows 10 all in server mode.

From a different Linux box - nomal Ubuntu 22.04 - running 50-70 iperf3 clients half in normal half in -R (reverse) mode, each doing UDP 1 Mb/s. Hang happens anywhere from 2-8 hours into this test.

Batch file on the client: KillMe.bat Script on the server: KillYou.sh I copy out the appropriate column of the attached spreadsheet to a script file.

KillMeKillYou.xlsx

YanVugenfirer commented 6 months ago

@dasoussan I suggest to test the scenario on QEMU as well.

It might look like a "familiar" case of TX packet stack on the host side. We encountered such issues before in in vast majority of the cases they were due to the host network stack. Also worth sending NMI couple of times more to check if the driver spends time in the same place.

YanVugenfirer commented 6 months ago

@dasoussan can you please post the scripts in the comment?

dasoussan commented 6 months ago

Batch file KillMe (runs on the Windows 10 side):

start f:\das\iperf321\iperf3.exe -s -p 5001 start f:\das\iperf321\iperf3.exe -s -p 5002 start f:\das\iperf321\iperf3.exe -s -p 5003 start f:\das\iperf321\iperf3.exe -s -p 5004 start f:\das\iperf321\iperf3.exe -s -p 5005 start f:\das\iperf321\iperf3.exe -s -p 5006 start f:\das\iperf321\iperf3.exe -s -p 5007 start f:\das\iperf321\iperf3.exe -s -p 5008 start f:\das\iperf321\iperf3.exe -s -p 5009 start f:\das\iperf321\iperf3.exe -s -p 5010 start f:\das\iperf321\iperf3.exe -s -p 5011 start f:\das\iperf321\iperf3.exe -s -p 5012 start f:\das\iperf321\iperf3.exe -s -p 5013 start f:\das\iperf321\iperf3.exe -s -p 5014 start f:\das\iperf321\iperf3.exe -s -p 5015 start f:\das\iperf321\iperf3.exe -s -p 5016 start f:\das\iperf321\iperf3.exe -s -p 5017 start f:\das\iperf321\iperf3.exe -s -p 5018 start f:\das\iperf321\iperf3.exe -s -p 5019 start f:\das\iperf321\iperf3.exe -s -p 5020 start f:\das\iperf321\iperf3.exe -s -p 5021 start f:\das\iperf321\iperf3.exe -s -p 5022 start f:\das\iperf321\iperf3.exe -s -p 5023 start f:\das\iperf321\iperf3.exe -s -p 5024 start f:\das\iperf321\iperf3.exe -s -p 5025 start f:\das\iperf321\iperf3.exe -s -p 5026 start f:\das\iperf321\iperf3.exe -s -p 5027 start f:\das\iperf321\iperf3.exe -s -p 5028 start f:\das\iperf321\iperf3.exe -s -p 5029 start f:\das\iperf321\iperf3.exe -s -p 5030 start f:\das\iperf321\iperf3.exe -s -p 5031 start f:\das\iperf321\iperf3.exe -s -p 5032 start f:\das\iperf321\iperf3.exe -s -p 5033 start f:\das\iperf321\iperf3.exe -s -p 5034 start f:\das\iperf321\iperf3.exe -s -p 5035 start f:\das\iperf321\iperf3.exe -s -p 5036 start f:\das\iperf321\iperf3.exe -s -p 5037 start f:\das\iperf321\iperf3.exe -s -p 5038 start f:\das\iperf321\iperf3.exe -s -p 5039 start f:\das\iperf321\iperf3.exe -s -p 5040 start f:\das\iperf321\iperf3.exe -s -p 5041 start f:\das\iperf321\iperf3.exe -s -p 5042 start f:\das\iperf321\iperf3.exe -s -p 5043 start f:\das\iperf321\iperf3.exe -s -p 5044 start f:\das\iperf321\iperf3.exe -s -p 5045 start f:\das\iperf321\iperf3.exe -s -p 5046 start f:\das\iperf321\iperf3.exe -s -p 5047 start f:\das\iperf321\iperf3.exe -s -p 5048 start f:\das\iperf321\iperf3.exe -s -p 5049 start f:\das\iperf321\iperf3.exe -s -p 5050 start f:\das\iperf321\iperf3.exe -s -p 5051 start f:\das\iperf321\iperf3.exe -s -p 5052 start f:\das\iperf321\iperf3.exe -s -p 5053 start f:\das\iperf321\iperf3.exe -s -p 5054 start f:\das\iperf321\iperf3.exe -s -p 5055 start f:\das\iperf321\iperf3.exe -s -p 5056 start f:\das\iperf321\iperf3.exe -s -p 5057 start f:\das\iperf321\iperf3.exe -s -p 5058 start f:\das\iperf321\iperf3.exe -s -p 5059 start f:\das\iperf321\iperf3.exe -s -p 5060 start f:\das\iperf321\iperf3.exe -s -p 5061 start f:\das\iperf321\iperf3.exe -s -p 5062 start f:\das\iperf321\iperf3.exe -s -p 5063 start f:\das\iperf321\iperf3.exe -s -p 5064 start f:\das\iperf321\iperf3.exe -s -p 5065 start f:\das\iperf321\iperf3.exe -s -p 5066 start f:\das\iperf321\iperf3.exe -s -p 5067 start f:\das\iperf321\iperf3.exe -s -p 5068 start f:\das\iperf321\iperf3.exe -s -p 5069 start f:\das\iperf321\iperf3.exe -s -p 5070 start f:\das\iperf321\iperf3.exe -s -p 5071

Shell script KillYou.sh (runs on the Linux 22.04 lab machine):

!/bin/sh

iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5001 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5002 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5003 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5004 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5005 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5006 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5007 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5008 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5009 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5010 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5011 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5012 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5013 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5014 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5015 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5016 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5017 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5018 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5019 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5020 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5021 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5022 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5023 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5024 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5025 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5026 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5027 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5028 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5029 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5030 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5031 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5032 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5033 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5034 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5035 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5036 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5037 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5038 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5039 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5040 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5041 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5042 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5043 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5044 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5045 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5046 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5047 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5048 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5049 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5050 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5051 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5052 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5053 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5054 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5055 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5056 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5057 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5058 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5059 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5060 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5061 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5062 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5063 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5064 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5065 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5066 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5067 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5068 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5069 & iperf3 -u -R -c 10.192.0.2 -b 1M -t 3600 -p 5070 & iperf3 -u -c 10.192.0.2 -b 1M -t 3600 -p 5071 &

dasoussan commented 6 months ago

Here is the traffic graph - it isn't a traffic volume issue: TrafficGraph

Blue brace area I'm developing and tweaking the script, running solid 3.0 MB/s up and down for a few hours, hangs at the red lines. This was running the newer drivers so I need to get the symbols file and convince windbg to digest it. More as I know it....

dasoussan commented 6 months ago

Attached are 3 more snips of crashdumps - all were ~50-70 iperf3 UDP runs, 2 different physical machines. First 2 are running the newer V248 version of netkvm, the latest I could find.

The file titled semi-hang was weird. The time still updated on the Windows VM tool bar bottom right, no windows were responding to any mouse or keyboard events, and network traffic was still flowing so stuff was still running.

The other two were the solid lock-ups, time not updating, etc.

All three had the VBox log message

VMMDev: vmmDevHeartbeatFlatlinedTimer: Guest seems to be unresponsive. Last heartbeat received 4 seconds ago.

Switching everything to QEMU isn't an option for many reasons.

I restarted the tests after switching both test machines to use the "Intel PRO/1000 MT Server (82545EM)", restarted both virtual machines, and am running the KillMe / KillYou scripts and have been for a ~7 hours each, so far no crashes. I'll keep them cooking at least overnight and report results in the morning.

Semi-Hang20240311-1.txt Hang20240311-2.txt Hang20240311-3.txt

YanVugenfirer commented 6 months ago

It looks like the driver spends time retrieving the descriptors that were transmitted by the host. And at the same time, it should be flooded with RX packets.

How many NICs do you have in the guest? How many VMs are running simultaneously on the host?

Try to do the following (each is a separate experiment):

  1. Disable RSS - use the device manager advanced tab
  2. In the registry (this parameter is not added by default to INF, so you will have to add registry value manually) under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class{4D36E972-E325-11CE-BFC1-08002BE10318}\<number depending on your nics and network config, check that it has a reference to NetKVM> ; add "NumberOfHandledRXPacketsInDPC" DWORD value and set it to 100. Disable\enable the device in device manager.
YanVugenfirer commented 6 months ago

BTW: if already in the device manager, please check in the resource tab what kind of interrupts are used by the device (legacy or MSI). You can take a screenshot. But in general, MSI interrupts will look like "negative" numbers.

dasoussan commented 6 months ago

I've now got > 24 hours running the same test on the same two machines but instead of the netkvm driver using the "Intel PRO/1000 MT Server (82545EM)" NIC instead - no crashes, no hangs, just lots of iperf packets flying around.

All settings were the default.

Right now there is only one NIC in the guest Win10 VM. The host has 3 NICs; one 1G copper RJ45, two 10G SFP+ fiber or one SFP+ 1G copper - different on the two testing units as they are in different projects. I said "Right now" because originally we mapped all NICs on the host to the guest but months ago went to a single NIC in the guest approach. Eventually the two SFP+ will be bonded on the host side, active-standby failover, and presented as one NIC to the guest machine. Not yet, but soon.

Only one VM running on the host.

Since I've got both machines in the Intel NIC configuration, I looked and didn't see an RSS setting in that NIC.

I converted one machine back to the VirtIO and RSS was enabled by default on the VirtIO adapter. I disabled it on one test machine and am re-running the iperf3 stressing tests. Cooking for an hour so far, nothing to report.

Looks like Legacy IRQ - 10, screen shot is attached. IRQ

Thanks for looking into this! David

dasoussan commented 6 months ago

A slight correction - I'm only seeing ~30 iperf3 sessions between the lab machine and guest VM. I dug into it this morning and found the error "fatal error - console device allocation failure - too many consoles in use, max consoles is 32" when I tried to spawn one manually. The error message wasn't visible as the script probably displayed something but it was so brief I never saw the message. So I'm really doing 15 x 1Mb/s iperf3 sessions to and 15 from the VM to/from the guest.

dasoussan commented 6 months ago

Got a hang an hour ago, pulled a crash dump, analysis in the morning.

Did the registry key (I had 3 entries in the registry from when we had 3 VirtIO NICs, not knowing which was current I put the key/value in all of them.)

Starting up another stress overnight. The last one hit after ~15 hours. I only have one test machine for now, gave the others back to the crew to use in development after changing it back to Intel's virtual NIC.

dasoussan commented 6 months ago

Both thread stacks are here:

memory-hang20240312NoRSS.txt memory-Hang20240312-2-RegistryMods.txt

NoRSS test case and the registry mods as requested. I fresh booted the VM before each test.

VM hangs look the same to me.

Thanks!

dasoussan commented 6 months ago

I've had 3 more hangs today. Not much point in posting the crash dumps - nothing new to see in them. Same fault in the same place.

Any ideas?

YanVugenfirer commented 6 months ago

Can you test your setup on QEMU with MSI interrupts (it will be a default)?

dasoussan commented 6 months ago

@vanVugenfirer, moving everything over to QEMU would be doable if QEMU already supported all the same commands and stuff we are doing to manipulate VirtualBox - so not a practical path forward.

Curiously #1, I setup everything on completely different hardware - a Dell Optiplex micro form factor PC with the same Linux image but now different hardware and was able to replicate the hang condition with the NetKVM driver.

We have switched to the Intel driver and are running fine so our immediate need is solved by not using these drivers.

Curiously#2 we weren't running fine initially as it turns out the VBox Intel driver on the hardware we have does not support TCP large segment offloading. Lots of debugging later to discover this, things got better once we disabled that in Windows 10's VM.

Curiously #3: I've tried running the iperf3 stress test with the driver verifier watching the NetKVM driver and have yet to replicate a failure in that scenario. No doubt the quantum physics observer effect.

YanVugenfirer commented 6 months ago

Driver verifier adds delays to the system, so the timing bugs might disappear.

I am still puzzled as to why VirtualBox does not support MSI interrupts. It definitely can be a bug in the driver due to the scenario that we do not expect on QEMU for the last ±8 years.

dasoussan commented 6 months ago

The VM was initially setup with the PIIX3 virtual system board instead of the ICH9 board. Some light reading indicates that might be why MSI interrupts weren't supported. Something I can try on one of my test boxes sometime soon.

dasoussanATRO commented 5 months ago

Switching the Win10 VM to use the ICH9 chipset instead of the PIIX3 and putting the VirtIO / netkvm NIC back in place, checking the properties of the NIC - I don't even see a resources tab anymore. image

I'm guessing that means it is using MSI interrupts?

In this configuration I started up a stress test to see what happens. Stay tuned, more to come...

YanVugenfirer commented 5 months ago

Strange... You still should be able to see the resource tab also with MSI interrupts.

dasoussan commented 5 months ago

Changing to the ICH9 chipset does improve CPU usage of the VM when it is sitting mostly idle but does not resolve the NetKVM hang spinlock.

ybendito commented 5 months ago

@dasoussan It would be good to look at the crash dump. We can get from it more information than from the stack dump.

dasoussanATRO commented 5 months ago

I've asked if I can publish that but I don't think I'll be allowed to. I can run whatever you want on the dump file and publish the results assuming nothing is inside that I can't share.