virtio-win / kvm-guest-drivers-windows

Windows paravirtualized drivers for QEMU\KVM
https://www.linux-kvm.org/page/WindowsGuestDrivers
BSD 3-Clause "New" or "Revised" License
2.04k stars 387 forks source link

Windows with NetKVM has a little chance hang during shutdown #350

Closed npes87184 closed 5 years ago

npes87184 commented 5 years ago

If the NetKVM has not initialized yet (or reset), QEMU will ignore the control message. It will cause driver hanging in infinite while loop for waiting response (which leads to windows hang in shutdown). In our test, there is a little chance to receive an OID for setting multicast list after device reset. To make symptom more clear, I have added some log. Below is the log and call stack when this situation happens:

Logs:

[ParaNdis_OnShutdown]=> Common [ParaNdis_ResetVirtIONetDevice] Done [ParaNdis_OnOidSetMulticastList] Go [ParaNdis_UpdateDeviceFilters] Do [CParaNdisCX::SendControlMessage] enter: 4277 - cls: 0, cmd: 0 CParaNdisCX::SendControlMessage - ERROR: in while

Assertion: *** DPC watchdog timeout This is NOT a break in update time This is most likely a BUG in an ISR Perform a stack trace to find the culprit The period will be doubled on continuation Use gh to continue!!

Call stack:

kd> k ChildEBP RetAddr
8b582530 83ab50be nt!KeAccumulateTicks+0x3c5 8b582570 83ab4f6b nt!KeUpdateRunTime+0x145 8b5825c8 83ab9c17 nt!KeUpdateSystemTime+0x613 8b5825c8 83a10122 nt!KeUpdateSystemTimeAssist+0x13 8b582680 8e16db70 hal!HalpTscStallExecutionProcessor+0xc0 WARNING: Stack unwind information not available. Following frames may be wrong. 8b582700 8e16b216 netkvm+0x4b70 8b582730 8e16c418 netkvm+0x2216 8b582740 8e16f219 netkvm+0x3418 8b582750 8e17740f netkvm+0x6219 8b5827a8 880fc32b netkvm+0xe40f 8b5827f0 880a8ce3 ndis!ndisMDoOidRequest+0x3a1 8b582818 880a87ff ndis!ndisFQueueRequestOnNext+0x21e 8b582880 8df21086 ndis!NdisFOidRequest+0x181 8b5828a0 880fc961 pacer!PcFilterRequest+0x78 8b582900 880a8c5c ndis!ndisFDoOidRequest+0x1a2 8b582928 880a87ff ndis!ndisFQueueRequestOnNext+0x197 8b582990 8df15b22 ndis!NdisFOidRequest+0x181 8b5829b0 880fc961 wfplwf!FilterOidRequest+0x50 8b582a10 880a7799 ndis!ndisFDoOidRequest+0x1a2 8b582a3c 8810d602 ndis!ndisQueueRequestOnTop+0x224 8b582ab0 8810d686 ndis!ndisMOidRequest+0x1dd 8b582ac0 882a4646 ndis!NdisOidRequest+0xf 8b582b7c 88254d1d tcpip!FlpNdisRequestUnderReference+0x7d 8b582bb8 8825523b tcpip!FlpMulticastListSet+0xa1 8b582bc4 88251a2e tcpip!FlpSerializedNdisDeleteGroupWorker+0xd 8b582bec 83c4f460 tcpip!FlpSerializedNdisRequestWorkerRoutine+0x41 8b582c00 83ab7aab nt!IopProcessWorkItem+0x23 8b582c50 83c43f5e nt!ExpWorkerThread+0x10d 8b582c90 83aeb219 nt!PspSystemThreadStartup+0x9e 00000000 00000000 nt!KiThreadStartup+0x19

I proposed two approaches to solve this problem:

  1. apply the following patch to simply prevent infinite waiting.
commit bf5ba3319594cef72d279a0d82137141fcff6581
Author: yuchenlin <yuchenlin@synology.com>
Date:   Fri Nov 30 13:50:34 2018 +0800

    NetKVM: Prevent hanging in SendControlMessage

    Windows with NetKVM has a little chance hang during shutdown.

    If the NetKVM has not initialized yet (or reset), QEMU will ignore
    the control message. It will cause driver hanging in infinite while loop
    for waiting response (which leads to windows hang in shutdown). In our
    test, there is a little chance to receive an OID for setting multicast list
    after device reset. To make symptom more clear, I have added some log.
    Below is the log and call stack when this situation happens:

    Logs:

    [ParaNdis_OnShutdown]=>
    Common [ParaNdis_ResetVirtIONetDevice] Done
    [ParaNdis_OnOidSetMulticastList] Go
    [ParaNdis_UpdateDeviceFilters] Do
    [CParaNdisCX::SendControlMessage] enter: 4277 - cls: 0, cmd: 0
    CParaNdisCX::SendControlMessage - ERROR: in while
    Assertion: *** DPC watchdog timeout
        This is NOT a break in update time
        This is most likely a BUG in an ISR
        Perform a stack trace to find the culprit
        The period will be doubled on continuation
        Use gh to continue!!

    Call stack:

    kd> k
    ChildEBP RetAddr
    8b582530 83ab50be nt!KeAccumulateTicks+0x3c5
    8b582570 83ab4f6b nt!KeUpdateRunTime+0x145
    8b5825c8 83ab9c17 nt!KeUpdateSystemTime+0x613
    8b5825c8 83a10122 nt!KeUpdateSystemTimeAssist+0x13
    8b582680 8e16db70 hal!HalpTscStallExecutionProcessor+0xc0
    WARNING: Stack unwind information not available. Following frames may be
    wrong.
    8b582700 8e16b216 netkvm+0x4b70
    8b582730 8e16c418 netkvm+0x2216
    8b582740 8e16f219 netkvm+0x3418
    8b582750 8e17740f netkvm+0x6219
    8b5827a8 880fc32b netkvm+0xe40f
    8b5827f0 880a8ce3 ndis!ndisMDoOidRequest+0x3a1
    8b582818 880a87ff ndis!ndisFQueueRequestOnNext+0x21e
    8b582880 8df21086 ndis!NdisFOidRequest+0x181
    8b5828a0 880fc961 pacer!PcFilterRequest+0x78
    8b582900 880a8c5c ndis!ndisFDoOidRequest+0x1a2
    8b582928 880a87ff ndis!ndisFQueueRequestOnNext+0x197
    8b582990 8df15b22 ndis!NdisFOidRequest+0x181
    8b5829b0 880fc961 wfplwf!FilterOidRequest+0x50
    8b582a10 880a7799 ndis!ndisFDoOidRequest+0x1a2
    8b582a3c 8810d602 ndis!ndisQueueRequestOnTop+0x224
    8b582ab0 8810d686 ndis!ndisMOidRequest+0x1dd
    8b582ac0 882a4646 ndis!NdisOidRequest+0xf
    8b582b7c 88254d1d tcpip!FlpNdisRequestUnderReference+0x7d
    8b582bb8 8825523b tcpip!FlpMulticastListSet+0xa1
    8b582bc4 88251a2e tcpip!FlpSerializedNdisDeleteGroupWorker+0xd
    8b582bec 83c4f460 tcpip!FlpSerializedNdisRequestWorkerRoutine+0x41
    8b582c00 83ab7aab nt!IopProcessWorkItem+0x23
    8b582c50 83c43f5e nt!ExpWorkerThread+0x10d
    8b582c90 83aeb219 nt!PspSystemThreadStartup+0x9e
    00000000 00000000 nt!KiThreadStartup+0x19

    Signed-off-by: yuchenlin <yuchenlin@synology.com>

diff --git a/NetKVM/Common/ParaNdis-CX.cpp b/NetKVM/Common/ParaNdis-CX.cpp
index 69516dbb..68f98435 100644
--- a/NetKVM/Common/ParaNdis-CX.cpp
+++ b/NetKVM/Common/ParaNdis-CX.cpp
@@ -102,17 +102,24 @@ BOOLEAN CParaNdisCX::SendControlMessage(
         if (0 <= m_VirtQueue.AddBuf(sg, nOut, 1, (PVOID)1, NULL, 0))
         {
             UINT len;
+            void *p;
             m_Context->m_CxStateMachine.RegisterOutstandingItem();

             m_VirtQueue.Kick();
-            while (!m_VirtQueue.GetBuf(&len))
+            p = m_VirtQueue.GetBuf(&len);
+            for (int i = 0; i < 10 && !p; ++i)
             {
                 UINT interval = 1;
                 NdisStallExecution(interval);
+                p = m_VirtQueue.GetBuf(&len);
             }
             m_Context->m_CxStateMachine.UnregisterOutstandingItem();

-            if (len != sizeof(virtio_net_ctrl_ack))
+            if (!p)
+            {
+                DPrintf(0, "%s - ERROR: get_buf failed\n", __FUNCTION__);
+            }
+            else if (len != sizeof(virtio_net_ctrl_ack))
             {
                 DPrintf(0, "%s - ERROR: wrong len %d\n", __FUNCTION__, len);
             }
  1. Implement a fine-tuned synchronous mechanism to prevent sending any event to QEMU when the device is not ready. For example: before SendControlMessage, we should check driver state by checking pContext->bDeviceInitialized. However, if it does not have a fine-tuned synchronous mechanism, checking and setting bDeviceInitialized may run in any order. It still cause the problem. But, I don't know how many effort to do this.

I want to discuss this problem and the solution.

Thanks, Yu-Chen Lin

npes87184 commented 5 years ago

Related #206 Include @sameehj @ladipro

Thanks, Yu-Chen Lin

YanVugenfirer commented 5 years ago

Hello Yu-Chen,

Can you please send pull request with your patch?

Thanks, Yan.

Sent from my iPhone

On Nov 30, 2018, at 07:08, yuchenlin notifications@github.com wrote:

Related #206 Include @sameehj @ladipro

Thanks, Yu-Chen Lin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

npes87184 commented 5 years ago

Hi, Yan

Sure, see #351

Thanks, Yu-Chen Lin

npes87184 commented 5 years ago

Close due to #351 merged. Thanks all!