NetKVM driver failure on Windows Server 2019

mle-ii commented 5 years ago

I'm still in the research phase, but we're currently having issues with the NetKVM drivers.

Some details: Using Bhyve for virtualization Windows Server 2019 NetKVM 141 Stable drivers obtained from here: https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/virtio-win-0.1.141-1/ It doesn't specify 2019 or have a folder for that so we used the Windows Server 2016 drivers, but from what I've read here in other issues that were resolved those should work as they are the same. And running a file hash seems to show that they were the same as well.

The issue is that we had the network stop working 2 nights in a row. The first night we were using the latest 164 drivers, so we decided to roll back to the 141 Stable version as we had issues on Windows 2016 servers with the newer drivers. The second night the same thing, network just stopped.

When I went to the driver in Device Manager it noted this for the status. No drivers are installed for this device. yet the driver files were still there, it also showed as Disabled here in the and in the network panel. One other thing it showed is that it was in the D3 power state rather than D0, though I wasn't sure if this was just due to it being disabled or not. I didn't see any configuration for making this device not sleep so I figured it wasn't due to that, but this I'm not certain about.

Other than the errors related to a network being disabled I couldn't find any data in Event Viewer indicating why the network drivers were disabled.

I also tried TraceView tool using the PDB in the iso that we downloaded for the drivers above as per your readme. But when I tried to create a new session with that PDB it says PDB file does not contain provider information. so I am unable to run this. https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/NetKVM/Documentation/Tracing.md

Do you have any recommendations for how we can go about figuring out the issue with this driver? I'm not as familiar with getting data for driver issues, so perhaps I'm missing some key information in Event Viewer by either not looking in the right spot or I don't have an appropriate log enabled. Or perhaps there is a log file with more data that I am unaware of for device drivers.

Please let me know what information you might need to help you or me investigate this issue.

ybendito commented 5 years ago

@mle-ii Tracing GUID for NetKVM you can find in Trace.h file under netkvm directory, you can use it to collect the trace information using logman, then decode it using tracefmt with actual PDB file. With default logging level the driver emit logs only when it is enabled/disabled, so you can disable/enable it via device manager to ensure you're able to collect driver logs. There is no reason to use old builds (like 141), build 160 is stable and passed all the required HLK tests. I'd suggest to run some batch with ping and timestamp printouts overnight, this will help you to see when exactly the network adapter stops working, then probably you can check log files of the driver and events in system event log around this time.

ybendito commented 5 years ago

@mle-ii I think the problem might be originated by the hypervisor, so it makes sense to check also its logs. Note that our Windows drivers are aligned to QEMU support of virtio devices and tested with QEMU.

mle-ii commented 5 years ago

Thank you, I'll do more research to see if we can figure out what's going on. FWIW I tried another night and the behaviour was a bit different this night. The driver remained enabled and the power state remained at D0 but the network did indeed stop working that night. And it only appears to be when it's under a load when this happens. Other nights when I didn't send web traffic to the node it remained up.

I haven't yet been able to get logging data from the driver yet for some reason. I tried using the guid here with TraceView and it didn't seem to capture any real data. https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/NetKVM/Common/Trace.h#L29

What params would you use for logman when using that tool? I've never used it before and I couldn't figure out which params to specify for tracing your driver.

ybendito commented 5 years ago

@mle-ii Trace.zip I use this batch for logging with good results. Place tracefmt.exe and proper PDB file to the directory with this batch. Run it as admin.

ybendito commented 5 years ago

@mle-ii If the network stops working when the driver seems alive, I'd try to disable/enable it. I see 3 possibilities:

The driver receives IP address and the network works (Probably the problem is on driver side)
The driver does not receive IP address (Probably the problem is on hypervisor/host side)
The driver stucks and does not finish 'disable' (The hypervisor does not return sent packets) You may change 'Log.Level' in 'Advanced' page to 1 to have more debug output.

mle-ii commented 5 years ago

Thank you for the trace batch file, I should have noticed that before. I'll give the tracing a try again, hopefully your script works with the PDB they (fedorapeople) supplied in their build.

Given the current behavior I'm seeing I'm leaning towards 2 now. Unfortunately I already rebooted the box to get the network back, I should have thought to try that option though. I tried a few other things but not that one.

And good reminder on the log level, I realized after the fact that when I totally removed the driver and reinstalled the 141 version that the log level went back to the default.

Also, with regards to stable/latest, how can I figure out from this repro which version is stable/latest? When we went to get them from the location you recommend in the readme for getting the binaries it links 141 as stable and 164 as latest.

https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/

Stable -> https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/virtio-win-0.1.141-1/ Latest -> https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/virtio-win-0.1.164-2/

ybendito commented 5 years ago

@mle-ii If 141 gives you ETW logs (I do not remember), no problem, should be ok Also 164 should be ok and for sure does ETW. For example I see one that I personally like: https://fedorapeople.org/groups/virt/virtio-win/repo/srpms/virtio-win-0.1.160-1.src.rpm

mle-ii commented 5 years ago

Looks like we're running an older version of bhyve after looking more. And I found this issue which fixed in a newer version and might very well be related. Bhyve is the one we're trying to move to. https://smartos.org/bugview/OS-7613 https://github.com/joyent/illumos-joyent/commit/e393062f0aebf8081aed83fd67670d9094d2a2a3#diff-31a2eb9b40daf744ed2a78f6cd933005

YanVugenfirer commented 5 years ago

@mle-ii Sorry for inquiring, as we never saw before the usage of Windows virtio drivers on Bhyve hypervisor - can you tell us about your use-case?

Do you see issues with other drivers\devices? What are the virtio devices that you use?

Thanks, Yan.

mle-ii commented 5 years ago

No need to say sorry, you've done all the great work to even make this driver available to us. :)

I'm not as up on all the use cases as I'm not as familiar with our setup in this regard, but I'll get some more feedback from those who might be able to provide more context.

But I think some of the biggest reasons are the much improved performance with using bhyve in our environment. We noted a gain in network perf (due to it supporting the faster ethernet speeds) and a pretty good increase in CPU perf. Disk not as much improvement as we would have liked and in the case of older hardware it was significantly slower.

We're not yet noticing issue with other drivers/devices other than what I mentioned about disk on older hardware. As for which virtio drivers I believe it's only disk/network, but I'd have to check to be sure.

I'll follow up if this newer update helps solve the problem, seems highly likely, though I'm not as familiar with the underlying code of either the drivers or bhyve, but the discussion in the case makes me think it is related.

YanVugenfirer commented 5 years ago

Thanks for sharing!

mle-ii commented 5 years ago

Update, we upgraded the version of bhyve and components to a newer version that includes the network fix I pointed out. The first night we ran it with the 164 drivers found here it behaved in the same way, network failed at around 3:37 am yesterday morning. https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/virtio-win-0.1.164-2/ Driver was still enabled, but network was not working. I tried disabling it and it hung the driver control UI, I waited a very long time but it never returned. So I force closed the UI. After this the UI was unresponsive. I then tried to reboot the VM, it hung at restarting for a long time and eventually blue screened with Stop code: DRIVER POWER STATE FAILURE. After it finally rebooted successfully.

We decided to downgrade to a driver version that is working in Windows 2012 R2 and 2016 as any newer didn't work there in qemu as well. We downgraded to this version: https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/virtio-win-0.1.141-1/

It finally made it through the night and the network didn't hang.

So the bhyve update appeared to fix one problem, but it seems like there's still an issue with drivers > 141 for us, we had a similar issue with 149 in Windows 2012 R2 and 2016 running in qemu that we were having with 164.

Also, is there a way to figure out if perhaps the GUID was changed by the folks who built the one we downloaded above? I ask as it seems like no data is obtained, even if no failure occurs we should get some data from logman, correct?

For example this is what part of that batch file output: Processing completed Buffers: 1, Events: 2, EventsLost: 0 :: Format Errors: 0, Unknowns: 0

And the netkvm.log file was empty.

mle-ii commented 5 years ago

An additional note, we also ran into the same issue with the 100.76.104.16000 drivers as well on Windows 2016 + qemu and had to revert those to the 141 drivers.

mle-ii commented 5 years ago

I probably won't find anything, but which sha went into the changes after the 141 build? https://fedorapeople.org/groups/virt/virtio-win/CHANGELOG

So perhaps just the differences between 141 and 149 releases? Seems like it'll be a lot as it's almost 7 months.

mle-ii commented 5 years ago

I think this list gets us close to the changes post 141 to 149. https://github.com/virtio-win/kvm-guest-drivers-windows/compare/1680088a478b6fdfa947e6c12bc2db4020e064e6...2fcc4b40d4f595ae9931b1a6d9dd82bba79e78a4

No idea if it's a hang we're hitting though, but the changes here and the hang that was fixed seems suspicious with our behavior given our somewhat randomness and it requiring us to be under load. https://github.com/virtio-win/kvm-guest-drivers-windows/pull/197 https://github.com/virtio-win/kvm-guest-drivers-windows/pull/201 https://github.com/virtio-win/kvm-guest-drivers-windows/pull/202

mle-ii commented 5 years ago

And... after going through the changes I found out why I couldn't get the tracing tool to work. That change wasn't in 141, but is in 149 and later. I will try to get it running again on a box that I can allow to fail.

Apparently I didn't read the more generic tracing doc well enough as I see I could have figured out the issue I was having with TraceView earlier with this information.
https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/Documentation/Tracing.md#obtaining-the-providers-control-guid-enabled-flags-and-level

YanVugenfirer commented 5 years ago

Hi,

First, I am missing something in your description. In the first comment, you wrote that you reverted the driver to version 141 but still " The second night the same thing, network just stopped.". So I want to be sure that you had an issue with 141 as well. If you did, looking at the commits might be a wrong direction.

Regarding the symptom of the hanged UI in device manager when you tried to disable the driver - it means that the driver failed to upload (you could confirm it using traces as well). in 99% of the cases, it means that we have packets submitted to the host that were not completed by the driver on transmit path.

So first of all, if it is possible, please upload a dump file (kernel dump) so we can analyze it and check the internal driver state. To create crash dump refer to https://www.slideshare.net/YanVugenfirer/windows-guestdebugging-kvmforum2012?qid=86d63495-1094-46a9-a8e1-9e76d14d2ea2&v=&b=&from_search=4 starting from pages 31-37 (configure crash dump generation on the guest and issue "NMI" command in QEMU monitor). Also, can you describe what are the parameters and feature of the virtio-net device you are using? The most interesting are multi-queue, published indices, vhost (or similar to vhost mechanism in Bhyve hypervisor, if exists). Also can you please post a screenshot from device manager of the resource tab (interrupts are the interesting part)?

Now let's try go over possible reasons for the hang:

Packet submitted to the host, the hypervisor didn't complete the packet (did not issue the interrupt to the host).
There is a bug in published indices mechanism. This is an optimization to reduce the number of interrupts and VM exits.
Guest received completion interrupt but there is a race condition in handling it and therefore one of the submitted buffers is not handled.
If legacy interrupts are used, bugs in interrupt handling by other devices

What can we check now:

Disable multitude on the host
Disable vhost on the host
Disable RSS on the guest (through device manager on the guest)
Disable published indices on the host or on the guest (through device manager on the guest)

Best regards, Yan.

mle-ii commented 5 years ago

I'll answer your questions in a bit, but a quick note. I meant that we went back to 141 and it appeared to run through the night.

But last night after leaving it run for longer we hit it again. The network failed at around 8:21 PM our time with the updated bhyve version and the 141 version of the driver. So it appears that either something in bhyve is causing this still or it's a different driver issue we're hitting here. We have a case open with joyent as well.

Again, I'll try to answer what I can from your query, but right now I'm going to upgrade back to 164 as it has the tracing you built into versions 149 and later that I can try to get more data from the driver.

mle-ii commented 5 years ago

First, I am missing something in your description. In the first comment, you wrote that you reverted the driver to version 141 but still " The second night the same thing, network just stopped.". So I want to be sure that you had an issue with 141 as well. If you did, looking at the commits might be a wrong direction.

The other night it was 141 and it ran fine all night, last night it failed with these same 141 drivers.

Regarding the symptom of the hanged UI in device manager when you tried to disable the driver - it means that the driver failed to upload (you could confirm it using traces as well). in 99% of the cases, it means that we have packets submitted to the host that were not completed by the driver on transmit path.

So first of all, if it is possible, please upload a dump file (kernel dump) so we can analyze it and check the internal driver state. To create crash dump refer to https://www.slideshare.net/YanVugenfirer/windows-guestdebugging-kvmforum2012?qid=86d63495-1094-46a9-a8e1-9e76d14d2ea2&v=&b=&from_search=4 starting from pages 31-37 (configure crash dump generation on the guest and issue "NMI" command in QEMU monitor).

I will try to get that information for you. The previous dump was 21 gig, I got a bit of info but had to delete it to get the space back sorry. But when/if it occurs again I'll get that to you.

This was the only information I was able to get from it.

In NDIS-20190508-0935.dmp the instruction at nt!IopLiveDumpEndMirroringCallback+0x7e caused a kernel BugCheck 15E with the following details:

BugCheck Code 0x15E

Arg1 37 
Arg2 2 
Arg3 18446675380079110232 
Arg4 8

Also, can you describe what are the parameters and feature of the virtio-net device you are using? The most interesting are multi-queue, published indices, vhost (or similar to vhost mechanism in Bhyve hypervisor, if exists).

I'm fairly certain we're just using the defaults. Though I may not understand enough to answer your question. I've never debugged a kernel driver before, or if I did I've since forgotten it.

Also can you please post a screenshot from device manager of the resource tab (interrupts are the interesting part)?

Should be defaults here as well.

mle-ii commented 5 years ago

That is the 141 driver information. I'm going to go through the slides you gave to see if I can get more data before I upgrade the drivers to 164, reboot and then set up the tracing.

mle-ii commented 5 years ago

~~In your sides you mention setting the i8042prt and kbdhid parameters to crash on demand. Is that correct? I don't set this for netkvm instead?~~ Nevermind, I read up on what this does. And this is only for getting the debug info for the hang in the device manager, correct?

Also turning off the "Automatically restart" option is only for this hang as well, right? As having that on seems like it would get me stuck later when it blue screens on boot as it's done in the past when I was trying to use the Device manager in this failed network state.

mle-ii commented 5 years ago

As an update I went to change the NetKVM to disabled again in the UI and it hung, unfortunately I cannot do the crash on demand as it requires a reboot for those registry key values to be applied (might want to add that note to your slides). So I was unable to capture the kernel dump while it was hung.

I restarted the computer and it sat at restarting for a long time and then blue screened with this.

mle-ii commented 5 years ago

I set the auto restart back as I didn't have access to restart the VM in that blue screen state since it occured on reboot. I have a kernel dump though it's 16 gig. :/

It had similar data to the last bug check.

In NDIS-20190516-1019.dmp the instruction at nt!IopLiveDumpEndMirroringCallback+0x7e caused a kernel BugCheck 15E with the following details:
BugCheck Code 0x15E
Arg1 | 37
Arg2 | 2
Arg3 | 18446675964652919640
Arg4 | 8

I'm going to update the driver now to 164 and set up the tracing again.

Also I set the LogLevel to 4 as I wasn't certain what log level you needed for that.

mle-ii commented 5 years ago

Driver is updated, I verified that the logger is actually working. I'll likely stop and restart it several times today as I'm unsure how much data this might generate and don't want to fill up disk.

mle-ii commented 5 years ago

Looking through the data now, but we had the network fail on the 164 drivers again yesterday at around 6:30 pm. The trace appears to have captured data and I'm copying it to a computer where I can go through it a bit more.

I did go into the UI and this time just changed another property on the NetKVM driver that shouldn't have been as destructive as a disabling of the driver. It hung. I went and tried to force a kernel dump but the keyboard commands didn't work for some reason. :( The Kernel dump has similar data to the previous ones.

In NDIS-20190517-0910.dmp the instruction at nt!IopLiveDumpEndMirroringCallback+0x7e caused a kernel BugCheck 15E with the following details:
BugCheck Code 0x15E

Arg1 37 
Arg2 2 
Arg3 18446706191387152104 
Arg4 8

YanVugenfirer commented 5 years ago

Quick reply on the last comment - when you change parameters in device manager windows reloads the driver. That’s why it hanged. Can you generate MSI from the host side?

Sent from my iPhone

On May 17, 2019, at 19:55, mle-ii notifications@github.com wrote:

Looking through the data now, but we had the network fail on the 164 drivers again yesterday at around 6:30 pm. The trace appears to have captured data and I'm copying it to a computer where I can go through it a bit more.

I did go into the UI and this time just changed another property on the NetKVM driver that shouldn't have been as destructive as a disabling of the driver. It hung. I went and tried to force a kernel dump but the keyboard commands didn't work for some reason. :( The Kernel dump has similar data to the previous ones.

In NDIS-20190517-0910.dmp the instruction at nt!IopLiveDumpEndMirroringCallback+0x7e caused a kernel BugCheck 15E with the following details: BugCheck Code 0x15E

Arg1 37 Arg2 2 Arg3 18446706191387152104 Arg4 8 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

YanVugenfirer commented 5 years ago

Sorry , I meant to generate NMI.

Sent from my iPhone

On May 17, 2019, at 19:59, Yan Vugenfirer yan@daynix.com wrote:

Quick reply on the last comment - when you change parameters in device manager windows reloads the driver. That’s why it hanged. Can you generate MSI from the host side?

Sent from my iPhone

On May 17, 2019, at 19:55, mle-ii notifications@github.com wrote:

Looking through the data now, but we had the network fail on the 164 drivers again yesterday at around 6:30 pm. The trace appears to have captured data and I'm copying it to a computer where I can go through it a bit more.

I did go into the UI and this time just changed another property on the NetKVM driver that shouldn't have been as destructive as a disabling of the driver. It hung. I went and tried to force a kernel dump but the keyboard commands didn't work for some reason. :( The Kernel dump has similar data to the previous ones.

In NDIS-20190517-0910.dmp the instruction at nt!IopLiveDumpEndMirroringCallback+0x7e caused a kernel BugCheck 15E with the following details: BugCheck Code 0x15E

Arg1 37 Arg2 2 Arg3 18446706191387152104 Arg4 8 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mle-ii commented 5 years ago

I'm unsure, I'll need to chat with the Ops members who have access to that as I don't have access to the host.

What might I be looking for in the etl or netkvm.log file? I can see a change in the output of the trace netkvm.log file for around the time when it failed but I don't see anything that seems to be an error state. The change is a noticeable drop in the number of frames per second. Yet I don't see anything in the log that resembles an error code.

Also, in the kernel dump I was able to get this is the call stack when I opened it in WinDbg.

 # RetAddr           : Args to Child                                                           : Call Site
00 fffff806`4317cf62 : ffffffff`ffffffff 00000000`00000011 00000000`00000000 00000000`00000011 : nt!IopLiveDumpEndMirroringCallback+0x7e
01 fffff806`43188ec7 : 00000000`00000000 fffff806`00000000 ffffdd8b`00000001 00000000`00000001 : nt!MmDuplicateMemory+0x26e
02 fffff806`4343060d : ffffdd8b`d4d5cd40 ffffdd8b`d4d5cd40 fffff98a`878e2248 fffff98a`878e2248 : nt!IopLiveDumpCaptureMemoryPages+0x7f
03 fffff806`43423ed2 : 00000000`00000000 ffffb204`d9644650 ffffb204`eff407f0 ffffb204`d9644650 : nt!IoCaptureLiveDump+0x289
04 fffff806`434245f0 : ffffffff`80001774 00000000`00000000 00000000`00000000 00000000`0000015e : nt!DbgkpWerCaptureLiveFullDump+0x13a
05 fffff806`43423d21 : 00000000`00000002 00000000`00000000 fffff800`6b8f6270 00000000`0000003f : nt!DbgkpWerProcessPolicyResult+0x30
06 fffff800`6b8bfbd1 : ffffdd8b`cecdd1a0 00000000`00000024 00000000`00000002 ffffdd8b`d5a14ee8 : nt!DbgkWerCaptureLiveKernelDump+0x1a1
07 fffff800`6b974480 : 00000000`00000000 fffff800`6b931d02 fffff98a`878e2380 00000000`00000000 : NDIS!ndisMLiveBugCheck+0x45
08 fffff800`6b97431d : ffffdd8b`fdc83040 ffffdd8b`d5a14e20 fffff800`6b9108d0 00000000`00000000 : NDIS!ndisReportTimeoutWaitingForExternalDriver+0x108
09 fffff800`6b9740f3 : 00000000`00000007 ffffdd8b`d5a14ee8 fffff800`6b974740 00000000`00000000 : NDIS!ndisFindSomeoneToBlame+0x125
0a fffff800`6b974749 : ffffdd8b`fdc83040 ffffdd8b`c9a84cd0 fffff800`6bfcf1f8 ffffdd8b`c9a50ab0 : NDIS!NdisWatchdogState::ReportTimeout+0x9f
0b fffff806`42c85afa : ffffdd8b`c9a84cd0 ffffdd8b`fdc83040 fffff800`6bfcf100 ffffdd8b`00000000 : NDIS!ndisWatchdogTimeoutWorkerRoutine+0x9
0c fffff806`42c4ea45 : ffffdd8b`fdc83040 ffffdd8b`c9a4e040 ffffdd8b`fdc83040 00000000`00000000 : nt!ExpWorkerThread+0x16a
0d fffff806`42dcbb8c : ffff9c00`abf00180 ffffdd8b`fdc83040 fffff806`42c4e9f0 00000000`00000000 : nt!PspSystemThreadStartup+0x55
0e 00000000`00000000 : fffff98a`878e3000 fffff98a`878dc000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x1c

mle-ii commented 5 years ago

The bug check info from WinDbg appears to match the one I pasted earlier.

BUGCODE_NDIS_DRIVER_LIVE_DUMP (15e)
The operating system recovered from an error in a networking driver.
NDIS has detected and recovered from a serious problem in another network
driver. Although the system was not halted, this problem may later cause
connectivity problems or a fatal bugcheck.
Arguments:
Arg1: 0000000000000025, NDIS_BUGCHECK_WATCHDOG
    An attempt to manage the network stack has taken too
    long. When NDIS calls out into other drivers, NDIS
    starts a watchdog timer to ensure the call completes
    promptly. If the call takes too long, NDIS injects a
    bugcheck.
    This can be caused by a simple deadlock -- look with
    "!stacks 2 ndis!" or similar to see if any threads
    look suspicious.  Pay special attention to the
    PrimaryThread from the NDIS_WATCHDOG_TRIAGE_BLOCK.
    This can be caused by lost NBLs, in which case
    !ndiskd.pendingnbls may help. Check for OIDs that are
    stuck using !ndiskd.oid.
Arg2: 0000000000000002, NDIS_BUGCHECK_WATCHDOG_PROTOCOL_NETPNPEVENT
    There was a timeout while delivering a
    NET_PNP_EVENT_NOTIFICATION to a protocol driver.
Arg3: ffffdd8bd5a14ee8, Cast to ndis!_NDIS_WATCHDOG_TRIAGE_BLOCK. Interesting fields:
    * StartTime shows what time the operation started,
    in 100ns units, as returned by KeQueryInterruptTime.
    * TimeoutMilliseconds shows how long NDIS waited, at a
    minimum, before triggering this bugcheck.
    Measured in milliseconds.
    * TargetObject is a handle to the protocol, filter,
    or miniport that NDIS is waiting on.  Use with
    !ndiskd.protocol, !ndiskd.filter, or !ndiskd.miniport.
    * PrimaryThread is the thread on which NDIS initiated
    the operation.  Usually this is the first place to
    look, although the thread may have gone elsewhere
    if the operation is being handled asynchronously.
Arg4: 0000000000000008, Net PnP event is NetEventPause

YanVugenfirer commented 5 years ago

~In your sides you mention setting the i8042prt and kbdhid parameters to crash on demand. Is that correct? I don't set this for netkvm instead?~ Nevermind, I read up on what this does. And this is only for getting the debug info for the hang in the device manager, correct?

This is for triggering NMI by using the keyboard.

Also turning off the "Automatically restart" option is only for this hang as well, right? As having that on seems like it would get me stuck later when it blue screens on boot as it's done in the past when I was trying to use the Device manager in this failed network state.

No need to remove "automatically restart" in your case. You know that BSOD happened, because you are the one generating it.

YanVugenfirer commented 5 years ago

Regarding the dump size. We need "kernel dump" - no need for the full dump that can take several gigabytes.

mle-ii commented 5 years ago

Thanks, I had set the UI to do kernel dumps as per the slides you gave me earlier and it still seems like it did a full memory dump. :/ Great slides by the way, thank you!

What might I be looking for in the etl or netkvm.log file? I can see a change in the output of the trace netkvm.log file for around the time when it failed but I don't see anything that seems to be an error state. The change is a noticeable drop in the number of frames per second. Yet I don't see anything in the log that resembles an error code. I set the Logging.Level to 4 in the driver, should I set it higher?

Also, we're having folks from Joyent take a look at bhyve to see if they can provide more information from that side of things. I'll update this case if we find out anything there of use for you.

mle-ii commented 5 years ago

The did find an issue which they are looking to provide a fix for, though they're not yet certain how it got into that state. I'll provide more details when I find out more.

mle-ii commented 5 years ago

Here is the issue they have opened. https://smartos.org/bugview/OS-7804

I'm still thinking that there is a driver issue as I am guessing the driver UI shouldn't hang no matter the underlying issue. That and we saw a similar issue with newer drivers in qemu as well with drivers > 141.

YanVugenfirer commented 5 years ago

Actually, the bug description makes sense why UI is hanging.

Here is what's happening - driver submits packet that is divided into more descriptors than the host can handle. The host doesn't handle the packet correctly and doesn't send the completion interrupt (and of course not propagating the indexes of the virt-ring).

When you disable the driver or change the parameters (in this case device manager disables\enables the driver) - the driver will wait for transmitted packets completion. Because the assumption is that this memory is on the host "side". We cannot return it to OS for reuse.

Best regards, Yan.

On Tue, May 21, 2019 at 10:05 PM mle-ii notifications@github.com wrote:

Here is the issue they have opened. https://smartos.org/bugview/OS-7804

I'm still thinking that there is a driver issue as I am guessing the driver UI shouldn't hang no matter the underlying issue. That and we saw a similar issue with newer drivers in qemu as well with drivers > 141.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/virtio-win/kvm-guest-drivers-windows/issues/396?email_source=notifications&email_token=AAIOIWGJS6LTRUG3RHBMDWTPWRBW5A5CNFSM4HL5YIPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV436II#issuecomment-494518049, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIOIWDTTXKZAA7NN3RVHHDPWRBW5ANCNFSM4HL5YIPA .

--

Daynix Computing LTD Yan Vugenfirer, CEO Email: yan@daynix.com Phone (Israel): +972-54-4758084 Phone (USA): +1-7204776716 Phone (UK): +44-2070482938 Web: www.daynix.com

mle-ii commented 5 years ago

We're still up and running. I'll close out this case. I might open a separate one for QEMU + 2012 R2/2016 + newer drivers if we still repro a similar issue there and we find it to not be a similar issue.

YanVugenfirer commented 5 years ago

Great news!

On May 28, 2019, at 20:14, mle-ii notifications@github.com wrote:

We're still up and running. I'll close out this case. I might open a separate one for QEMU + 2012 R2/2016 + newer drivers if we still repro a similar issue there and we find it to not be a similar issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

virtio-win / kvm-guest-drivers-windows

NetKVM driver failure on Windows Server 2019 #396

--