tianocore / edk2

EDK II
https://github.com/tianocore/tianocore.github.io/wiki/EDK-II
Other
4.64k stars 2.52k forks source link

random bsod of windows10 #94

Closed webczat closed 8 years ago

webczat commented 8 years ago

Hello.

When I install windows10 on qemu with enabled kvm and the ovmf firmware on my dedicated server, the installed guest system experiences many quite random blue screens. Depending on the current device configuration of the machine, the blue screen can be provoked by installing the network driver (only if the link is connected) or by connecting the network link after driver install, sometimes those blue screens happen randomly during windows usage, sometimes at regular intervals, sometimes it is enough to wait minute after oob to experience them, and it always happens nearly at the same time until machine reconfigured, at which point it may be that bsods will happen in a different way. There is no good configuration. I tried using virtio-net driver and changing the virtio net card to e1000, tried to use virtio blk and virtio scsi for hard drives, add or remove pcie ports (yes, id made a difference for some reason), tried to change number of cpus or cpu type, without any success. The only thing I didn't try is changing machine type from q35 to something else. So, windows10 guest just does not work in my environment when using ovmf. It seems that all those things that I have experienced do not happen at all when using seabios, and I have tested with few configurations. This is every piece of information I can give about my setup and host/guest hardware/software:

host processor: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz memory: 16 gigabytes of ram linux kernel version: 4.5.4 qemu version: 2.5.2 edk2 commit: 3ab7066e8d8ae43d9cdee76600b90918f8bee5d9 ovmf build command: build -DSECURE_BOOT_ENABLE -DNETWORK_IP6_ENABLE -DHTTP_BOOT_ENABLE guest windows version: 10 professional n, build 1511

This is one of the worst configurations of the vm that actually causes the bsod almost immediately after showing oob:

!/bin/sh

umask 0077 cd ~/vms/vm /bin/qemu-system-x86_64 \ -name vm -D vm.log \ -spice tls-port=25000,x509-cacert-file=../ca/cacert.pem,x509-cert-file=cert.pem,x509-key-file=key.pem,x509-dh-key-file=dh.pem,password=mypass \ -machine q35,accel=kvm,iommu=on -nodefaults -cpu host \ -chardev stdio,id=interface,mux=on,signal=off \ -chardev socket,id=control,server,nowait,path=control.sock \ -mon chardev=interface,mode=readline -mon chardev=control,mode=control \ -smp cores=4 -m size=2048M \ -device ib700 -device pvpanic -device isa-serial,chardev=interface \ -device ioh3420,port=1,chassis=1,id=pcieport1 \ -device ioh3420,id=pcieport2,port=2,chassis=2 \ -device virtio-balloon-pci,disable-modern=false \ -device virtio-rng-pci,disable-modern=false \ -device virtio-serial-pci,disable-modern=false \ -chardev socket,id=agent,server,nowait,path=agent.sock \ -device virtserialport,chardev=agent,name=org.qemu.guest_agent.0 \ -chardev spicevmc,name=vdagent,id=vdagent \ -device virtserialport,chardev=vdagent,name=com.redhat.spice.0 \ -device qxl-vga \ -device ich9-intel-hda -device hda-micro \ -netdev bridge,br=vmnet,id=net \ -device virtio-net-pci,disable-modern=false,netdev=net \ -device virtio-scsi-pci,disable-modern=false \ -device nec-usb-xhci -device nec-usb-xhci \ -chardev spicevmc,name=usbredir,id=usbredir1 \ -device usb-redir,chardev=usbredir1 \ -chardev spicevmc,name=usbredir,id=usbredir2 \ -device usb-redir,chardev=usbredir2 \ -device usb-tablet \ -drive id=cd1,media=cdrom,if=none \ -device ide-cd,drive=cd1,bus=ide.0 \ -drive id=cd2,if=none,media=cdrom \ -device ide-cd,drive=cd2,bus=ide.1 \ -drive id=disk,if=none,media=disk,format=qcow2,file=disk.img \ -device scsi-hd,drive=disk \ -drive if=pflash,media=disk,format=raw,file=firmware.bin \ -drive if=pflash,media=disk,format=raw,file=vars.bin \ -watchdog-action reset -rtc base=localtime \ -boot menu=on \ $@

lersek commented 8 years ago

I also have a windows 10 guest (installed from en_windows_10_enterprise_2015_ltsb_n_x64_dvd_6848316.iso), and I have not experienced such problems.

Your build flags don't include -D SMM_REQUIRE which is both good and bad. Good because SMM is always much harder to debug, and bad because things should really just work.

Since it reproduces easily on your side, and I've never encountered it on my side, can you please (a) capture a screenshot of the BSOD, (b) perhaps even try a debug/checked build (for those the BSOD tends to give more info), (c) capture a KVM trace?

Also, what is "oob"? Thanks.

lersek commented 8 years ago

BTW, have you tried the ignore_msrs=1 parameter with the kvm module?

lersek commented 8 years ago

(You can verify the current setting with cat /sys/module/kvm/parameters/ignore_msrs.)

webczat commented 8 years ago

oob = out of box experience, this last config screen just before login. Those bsods are... somehow random although it was easy to reproduce them for me. Not sure why they happen, really.

No, I am not running with ignore_msrs set. I am using a polish system and not sure if you will be able to read bsod in this case although the real body is still english hmmm

lersek commented 8 years ago

Please try with ignore_msrs set; Windows 10 has been known (especially development / beta versions) to massage a huge variety of MSRs, some (most?) of which are not implemented by KVM.

IIRC, ignore_msrs will cause KVM to ignore writes to MSRs unknown to it, and return 0 when MSRs unkown to it are read, as opposed to injecting a fault at once. Of course, a blanket 0 return value is not guaranteed to be safe either, but it works surprisingly often.

Please see the following references:

webczat commented 8 years ago

That msr usage is operating system, not firmware specific right? Remember that I said that everything works very well on seabios, and so windows probably does not have any reason to read some msrs only when run in uefi mode. In seabios I did not experience any bluescreens, only in when run with ovmf. I even tried rebuilding ovmf in case it was somehow damaged, no luck.

lersek commented 8 years ago

That msr usage is operating system, not firmware specific right?

That's not guaranteed at all, to my knowledge. We can't say for sure that Windows 10 will not base various hardware decisions on the firmware's characteristics. For example, in the SVVP test suite, a bunch of tests behave differently, and require different things from OS-level drivers, when they are executed in a VM with SeaBIOS vs. in a VM with OVMF. Here's an example.

Remember that I said that everything works very well on seabios,

Yes, I remember. People are very quick to point out "it works in SeaBIOS", but they never list their reasons why they don't use SeaBIOS for their workload then.

You appear to run your guest on a dedicated server, and not do VFIO device assignment. So what is your use case for insisting on OVMF? (There are other valid uses for OVMF, I'm just curious if you actually have one.)

and so windows probably does not have any reason to read some msrs only when run in uefi mode.

That's exactly what cannot be guaranteed. The item on answers.microsoft.com that I referenced above is about OVMF only too.

Here's another example. Windows 8 and Windows 10 installer ISOs exist for the following computer types:

Meaning, if you grab a 32-bit Windows 8 or 10 Client installer ISO, it is expected to boot the installer in both SeaBIOS and 32-bit OVMF guests. Except the latter case doesn't work at all, and months of on-and-off correspondence with a Microsoft developer (who has elected to discuss the isssue with me) has not produced results. Microsoft doesn't openly publish everything that they require from a UEFI platform, and SeaBIOS vs. OVMF may very well influence the behavior of the runtime OS.

Look, we can go back and forth on this. I've never encountered the issue you describe (I booted my Win10 VM yesterday to check again -- no issues). Also, I don't have the capacity to do exploratory testing, to see what "might" break, in the myriads of QEMU command line and Windows combinations. If you have one specific configuration that you would like to work, and it doesn't, you'll have to put in the effort to support our debugging and development.

In this item I have thus far recommended ignore_msrs, asked for a screenshot, and asked for a KVM trace. Instead of those, I'm seeing arguments why I'm supposedly wrong. Do you think that's productive? Or should I just ignore issue reports in this upstream tracker, like most other TianoCore maintainers do?

lersek commented 8 years ago

Also, for any OVMF-on-KVM issue report, the OVMF debug log and the host dmesg should be included as well.

(BTW, please don't paste all these data into the issue tracker -- please host them somewhere external, and only reference them here. The GitHub issue tracker sucks extremely because it doesn't allow for attachments, unfortunately. Once we move to Bugzilla, attachments should become possible.)

webczat commented 8 years ago

My server is not really a production thing, partially used for hosting few things, partially being a... playground. I am using my server because my laptop is too weak for really holding a vm especially if I want to use other tools at the same time, so I want such a vm for things I cannot use linux for like windows development/testing. Why ovmf? well, I do not have any really good reason. I was just trying to understand why things like ignore_msrs can matter in case of different firmwares, nothing else. As for things like screenshots and kvm trace, I currently had not enough time to try to install windows again, but I should be able to do it shortly. The only thing I am wondering about is if I will be able to get a screenshot in time but we'll see.

webczat commented 8 years ago

Ah, about logs: dmesg does not contain anything specific. if I am wrong I will correct myself when getting the screenshot. About debug logs of ovmf, I always don't know how to generate it, only that I use isa debugcon

lersek commented 8 years ago

Not sure if it helps, but a screenshot can be taken from the QEMU monitor as well. The HMP command is called screendump.

Second, your QEMU command line seems to have a bit of cruft. What do you need:

for?

Can you reproduce the issue with a minimal command line?

webczat commented 8 years ago

the problem is that I could. Or rather I could reproduce a different form of this issue without root ports. As I said it is partially a playground so I do not have the specific reason to add some things, for example iommu is added just because I wanted to say if windows can use it for anything, pcie root ports and xhci for increased hotplug capability/windows hotplug testing maybe. I can try removing some things further, like iommu etc, sure, but first I will test the configuration I have pasted here for reference. About screendump, the problem is the timing. I cannot disable automatic restart of blue screen if the blue screen happens on the last setup screens. Unless checked builds are known to have it defaulted, then it will help me a lot.

lersek commented 8 years ago

Or rather I could reproduce a different form of this issue without root ports

This kind of experimentation can actually be valuable, but only if you isolate the test cases. What command line produces exactly what kind of behavior, what cmdline differences cause what differences in behavior, and so on.

About screendump, the problem is the timing. I cannot disable automatic restart of blue screen if the blue screen happens on the last setup screens.

Can you perhaps write a shell script that connects to the monitor and dumps the screen into dump_$((COUNTER++)).ppm every five seconds? If you convert the PPM files to JPG and upload them somewhere in a tarball, I could flip through them quickly I guess.

Edit: corrected typo.

webczat commented 8 years ago

well: if you remove, from the configuration I gave, all root ports, and replace virtio scsi with virtio blk, you will get blue screens on network driver installs. usually, that is the problem, sometimes it is unpredictable. As for screenshots, I will find a way to do that, for example using gnome's screenshotting. Could you please give me instructions or link on how to enable ovmf debugging?

lersek commented 8 years ago

Could you please give me instructions or link on how to enable ovmf debugging?

Ah, certainly. Sorry for not providing those bits earlier.

There are two parts. The first part is setting the edk2 debug level (it is actually a debug mask). The second part is capturing the debug output.

The first part is optional, but I really like to get verbose messages, for which the DEBUG_VERBOSE bit has to be set in the debug mask. It can only be done at build time, so please apply the following patch, and then rebuild OVMF:

diff --git a/OvmfPkg/OvmfPkgIa32.dsc b/OvmfPkg/OvmfPkgIa32.dsc
index aa1a6193d19c..71567b8f0439 100644
--- a/OvmfPkg/OvmfPkgIa32.dsc
+++ b/OvmfPkg/OvmfPkgIa32.dsc
@@ -397,7 +397,7 @@ [PcdsFixedAtBuild]
   # DEBUG_VERBOSE   0x00400000  // Detailed debug messages that may
   #                             // significantly impact boot performance
   # DEBUG_ERROR     0x80000000  // Error
-  gEfiMdePkgTokenSpaceGuid.PcdDebugPrintErrorLevel|0x8000004F
+  gEfiMdePkgTokenSpaceGuid.PcdDebugPrintErrorLevel|0x8040004F

 !ifdef $(SOURCE_DEBUG_ENABLE)
   gEfiMdePkgTokenSpaceGuid.PcdDebugPropertyMask|0x17
diff --git a/OvmfPkg/OvmfPkgIa32X64.dsc b/OvmfPkg/OvmfPkgIa32X64.dsc
index f4b55c56e012..13693af9d29b 100644
--- a/OvmfPkg/OvmfPkgIa32X64.dsc
+++ b/OvmfPkg/OvmfPkgIa32X64.dsc
@@ -402,7 +402,7 @@ [PcdsFixedAtBuild]
   # DEBUG_VERBOSE   0x00400000  // Detailed debug messages that may
   #                             // significantly impact boot performance
   # DEBUG_ERROR     0x80000000  // Error
-  gEfiMdePkgTokenSpaceGuid.PcdDebugPrintErrorLevel|0x8000004F
+  gEfiMdePkgTokenSpaceGuid.PcdDebugPrintErrorLevel|0x8040004F

 !ifdef $(SOURCE_DEBUG_ENABLE)
   gEfiMdePkgTokenSpaceGuid.PcdDebugPropertyMask|0x17
diff --git a/OvmfPkg/OvmfPkgX64.dsc b/OvmfPkg/OvmfPkgX64.dsc
index 9c1118557012..81d28a1f51eb 100644
--- a/OvmfPkg/OvmfPkgX64.dsc
+++ b/OvmfPkg/OvmfPkgX64.dsc
@@ -402,7 +402,7 @@ [PcdsFixedAtBuild]
   # DEBUG_VERBOSE   0x00400000  // Detailed debug messages that may
   #                             // significantly impact boot performance
   # DEBUG_ERROR     0x80000000  // Error
-  gEfiMdePkgTokenSpaceGuid.PcdDebugPrintErrorLevel|0x8000004F
+  gEfiMdePkgTokenSpaceGuid.PcdDebugPrintErrorLevel|0x8040004F

 !ifdef $(SOURCE_DEBUG_ENABLE)
   gEfiMdePkgTokenSpaceGuid.PcdDebugPropertyMask|0x17

For the second part (it's also documented in OvmfPkg/README), please just add the following two options to the QEMU command line:

-debugcon file:ovmf.debug.log -global isa-debugcon.iobase=0x402

Thanks.

lersek commented 8 years ago

BTW, after logging in to MSDN, and filtering for Windows 10 build 1511 / 64-bit / ISO, I don't see a Professional edition. Can you please give me the exact file name of your installer ISO image?

webczat commented 8 years ago

hey, it is actually the multiple editions cd. my filename is: en_windows_10_multiple_editions_debug-checked_version_1511_x64_dvd_7226317.iso Explanation: I am downloading it. the previous one was the pl one without debug things, but I just deleted it and will test on the above.

lersek commented 8 years ago

I can't find this file in MSDN :(

lersek commented 8 years ago

I cannot reproduce the problem. I downloaded the installer ISO (see it below in the QEMU command line), and installed a Windows 10 N, Pro, build 1511 guest. This is the script I'm using. Relative to yours, it's only been minimally customized.

#!/bin/bash
set -e -u -C
ISO=/mnt/data/isos/iso-windows/en_windows_10_n_multiple_editions_version_1511_updated_apr_2016_x64_dvd_8710592.iso
VIRTIO_WIN_ISO=/usr/share/virtio-win/virtio-win.iso
CODE=/home/virt-images/OVMF_CODE.fd
TMPL=/home/virt-images/OVMF_VARS.fd

if test ! -e issue94.fd; then
  cp $TMPL issue94.fd
fi

if test ! -e issue94.qcow2; then
  qemu-img create \
    -f qcow2 \
    -o compat=1.1 \
    -o cluster_size=65536 \
    -o preallocation=metadata \
    -o lazy_refcounts=on \
    issue94.qcow2 \
    40G
fi

/opt/qemu-installed/bin/qemu-system-x86_64 \
  -nodefaults \
  \
  -machine q35,accel=kvm,iommu=on \
  -m size=2048M \
  -cpu host \
  -smp cores=4 \
  \
  -watchdog-action reset \
  -rtc base=localtime \
  -boot menu=on \
  \
  -device qxl-vga \
  -drive if=pflash,media=disk,format=raw,readonly,file=$CODE \
  -drive if=pflash,media=disk,format=raw,file=issue94.fd \
  \
  -debugcon file:issue94.log \
  -global isa-debugcon.iobase=0x402 \
  \
  -chardev stdio,id=interface,mux=on,signal=off \
  -mon chardev=interface,mode=readline \
  -device isa-serial,chardev=interface \
  \
  -device ib700 \
  -device pvpanic \
  \
  -device ich9-intel-hda \
  -device hda-micro \
  \
  -device nec-usb-xhci \
  -device nec-usb-xhci \
  \
  -device usb-tablet \
  \
  -device ioh3420,id=pcieport1,port=1,chassis=1 \
  -device ioh3420,id=pcieport2,port=2,chassis=2 \
  \
  -device virtio-balloon-pci,disable-modern=false \
  -device virtio-rng-pci,disable-modern=false \
  \
  -device virtio-serial-pci,disable-modern=false \
  \
  -chardev socket,id=agent,server,nowait,path=issue94.agent.sock \
  -device virtserialport,chardev=agent,name=org.qemu.guest_agent.0 \
  \
  -chardev spicevmc,name=vdagent,id=vdagent \
  -device virtserialport,chardev=vdagent,name=com.redhat.spice.0 \
  \
  -chardev spicevmc,name=usbredir,id=usbredir1 \
  -device usb-redir,chardev=usbredir1 \
  \
  -chardev spicevmc,name=usbredir,id=usbredir2 \
  -device usb-redir,chardev=usbredir2 \
  \
  -device virtio-scsi-pci,disable-modern=false \
  -drive id=disk,if=none,media=disk,format=qcow2,file=issue94.qcow2 \
  -device scsi-hd,drive=disk,bootindex=0 \
  \
  -drive id=cd1,media=cdrom,if=none,format=raw,readonly,file=$ISO \
  -device ide-cd,drive=cd1,bus=ide.0,bootindex=1 \
  \
  -drive id=cd2,media=cdrom,if=none,format=raw,readonly,file=$VIRTIO_WIN_ISO \
  -device ide-cd,drive=cd2,bus=ide.1 \
  \
  -netdev user,id=net,hostfwd=tcp:127.0.0.1:2294-:22 \
  -device virtio-net-pci,disable-modern=false,netdev=net \

I installed all of the virtio drivers (+QXL +PVPANIC) -- no yellow triangles in Device Manager. This includes virtio-net.

During installation, when I was asked "Get going fast" or some such, I clicked the "Use Express Settings" button.

As far as I can tell, everything works flawlessly and stably. For applications, I installed Firefox and Cygwin in the guest. I even tested S3 suspend / resume (three finger salute --> power icon --> Sleep --> wait 15 seconds --> press Enter). No BSOD, no crashes.

Here's my hardware and software stack:

Concerning the host kernel, ATM I don't think I'll again experiment with upstream kernels. That's kind of where I draw the line for now. If you have a KVM trace covering the BSOD, that might help (we could ask Paolo to take a look, perhaps).

Regarding the guest drivers (VirtIO and QXL WDDM), you might not be using the same builds as listed above. If you have installed the latest upstream binaries instead, I'm willing to try them. (Although, if they do evoke a crash, they will also get the blame immediately, rather than OVMF...)

lersek commented 8 years ago

For the record, I didn't install guest agents (QEMU guest agent or Spice guest agent).

webczat commented 8 years ago

Those cannot be drivers because except disk drivers I did not have any other ones. and because changing network card to something supported by windows itself also broke for me when booting. That means that as you said I have to provide more data. I hope I will be able to do this, I can do everything that does not involve rebooting (or often rebooting) the server, but that should not be a problem unless the kernel must be compiled with special options to use tracing?

lersek commented 8 years ago

Right, rebooting the host should not be necessary, I expect.

webczat commented 8 years ago

I know that bug does not appear on my laptop. with the similar configuration. so this may not be the kernel, or may depend on hardware somehow.

lersek commented 8 years ago

Perhaps on your i5-750 processor (Nehalem microarchitecture, code name Lynnfield), the -cpu host option (documented as "KVM processor with all supported host features (only available in KVM mode)") tickles something in Windows 10 that Windows 10 doesn't like.

How about trying another -cpu option that exposes a well-defined set of CPU features to the guest, while those features remain a subset of your host's features? I'm thinking of -cpu Penryn,enforce=true.

If QEMU rejects it, you can try -cpu Conroe,enforce=true.

webczat commented 8 years ago

It was one of the... known things to change the cpu model. and, I tried that once. with absolutely no result. And I actually changed to core2duo. It crashed like I did nothing. I am now trying to install the system. one unknown bsod during disk driver loading install, and ... the one on oob somehow does not happen or rather probably happened once when i was not looking. it may be somehow complicated, it seems, to reproduce.

webczat commented 8 years ago

I currently have a big problem. On this build, it always goes to the first configuration screen, then I press use express settings, then it... not sure what, I do not see any blue screen. but it restarts. the problem is that I do not really know if this time my cd was damaged or rather it is the same bug. and kvm traces are extra large.

lersek commented 8 years ago

Answering in reverse order:

kvm traces are extra large.

Yes, they always are, but it can be mitigated if you can localize (or interactively trigger) a crash. In that case start trace-cmd record -b 20000 -e kvm and await the Hit Ctrl^C to stop recording message just before the crash, and interrupt tracing right after. (But, I guess you already know this.)

if this time my cd was damaged

If you open the details for any given ISO in MSDN, it will tell you the SHA1 of the image, and you can verify it after download.

On this build, it always goes to the first configuration screen, then I press use express settings, then it... not sure what, I do not see any blue screen. but it restarts.

I repeated the installation on my laptop (= same host as above), with special attention to reboots. There are two reboots in total, after the initial (normal) boot:

Then the "express settings" button appears (on the "get going fast" screen). After clicking it, there is no reboot. What follows are: "just a moment" animation, account creation, "hi we're happy you're here" animation, and that's it -- it proceeds to the normal desktop GUI.

So, two restarts in total, and none of those should be occurring after clicking "express settings". If you have an OVMF debug log, you should find three instances of the string SEC: Normal boot in it.

If the guest reboots for you after clicking "express settings", hopefully you can start and finish KVM tracing just around that button click.

... I've checked in our internal shared machine pool whether we have an i5-750 -- we don't appear to be. The closest I've found is an i5-650 (same microarchitecture, i.e., Nehalem, but a Clarkdale desktop processor rather than your Lynnfield). I queued a reservation for it. Perhaps I can reproduce a BSOD on it.

webczat commented 8 years ago

honestly, after clicking this "use express settings" button, it restarts, and then returns to the screen with this button. something like a fatal error without blue screen? effect f a damaged cd? what it was, I am not sure. and thanks for pointing for sha1 checksums, although I am not sure if they are available on onthehub (I have my win10 from dreamspark subscription). This thing is hard to localize to make a small trace, unfortunately! I tried by removing few things that usually caused only bsod after network driver install, I would even have a chance to disable autorestarting.

zwei4 commented 8 years ago

I met the same “restart” issue after clicking this "use express settings" button while I verifying Windows 10 installation on my board. In my case, it is caused by an invalid “century” value of RTC(real time clock). I am not sure if this is the reason of your failure. But I suggest you to double-check “century” value of RTC.

Thanks, David

Intel SSG BIOS

From: Michał Zegan [mailto:notifications@github.com] Sent: Friday, June 03, 2016 4:14 AM To: tianocore/edk2 edk2@noreply.github.com Subject: Re: [tianocore/edk2] random bsod of windows10 (#94)

I currently have a big problem. On this build, it always goes to the first configuration screen, then I press use express settings, then it... not sure what, I do not see any blue screen. but it restarts. the problem is that I do not really know if this time my cd was damaged or rather it is the same bug. and kvm traces are extra large.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/tianocore/edk2/issues/94#issuecomment-223407993, or mute the threadhttps://github.com/notifications/unsubscribe/AP0n1LqmrUg8UQ5ieg6gHrhrRoDPOb4Vks5qHzlrgaJpZM4Iq0pu.

webczat commented 8 years ago

If I only knew what is a century value...

lersek commented 8 years ago

@zwei4 that's very helpful, thank you for chiming in!

@webczat I think @zwei4 means one of these commits: commit 41628cbc7cb or commit e38ab18a49ed. However, I don't think your OVMF build lacks these commits, because in the issue report you identified commit 3ab7066 as your build basis, and that one contains both of the former two.

webczat commented 8 years ago

well, but actually that problem with pressing use express settings appeared now, and I already have a newer commit. so it may be something new, or specific to this windows build, or whatever.

lersek commented 8 years ago

@webczat

honestly, after clicking this "use express settings" button, it restarts, and then returns to the screen with this button. something like a fatal error without blue screen

Again, can you start KVM tracing when the button first appears, just before clicking the button? And, stop tracing right after the VM reboots and the button appears for the second time?

I have my win10 from dreamspark subscription

AHA! That's very interesting for a separate reason. As I mentioned above, I couldn't find a debug/checked ISO for Windows 10 in MSDN, only a separate debug symbols MSI (that one has to install inside the already running guest). I didn't understand this, so I googled it. The most interesting result is:

https://www.osronline.com/showthread.cfm?link=269544

It suggests that the debug/checked ISO is available only on the DreamSpark website. (Am I correct to think that DreamSpark serves student subscriptions only?) It's strange.

Anyway, if you give me the name of the ISO file you have, I can tell you the SHA1 for it (assuming I downloaded the same file already, or if it is at least available in MSDN too -- which is apparently not guaranteed, according to the above).

I can give you the following checksums right now:

dbe728416545ea3e47fba05575e81ad0f595871f  en_windows_10_enterprise_2015_ltsb_n_x64_dvd_6848316.iso
e9e214f128ed325cba8782ce1946727807340c8b  en_windows_10_enterprise_2015_ltsb_n_x86_dvd_6848317.iso
edc8b56279baef8f9a9c97744426dbce3083cac6  en_windows_10_n_multiple_editions_version_1511_updated_apr_2016_x64_dvd_8710592.iso
webczat commented 8 years ago

The name of the iso is the one I gave you previously: en_windows_10_multiple_editions_debug-checked_version_1511_x64_dvd_7226317.iso I have downloaded it then put on server. not sure if my downloaded copy was valid, but I am sure it did not get damaged during upload to server. Yes, I am going to try installation again and do what you suggest, unless this will not happen again, of course. The thing is why this cd behaves differently from the multiple editions cd without debug symbols? Unless, of course, it was damaged.

webczat commented 8 years ago

My cd seems to be older than the normal one. not sure if I already said that?

webczat commented 8 years ago

during last install I had two bsods in the first pass that is when it said it is getting ready probably, so I went impatient. I will not send a trace of the event for now, I will redownload the non debug version of windows and try with it. It should be more predictable and I can possibly get the original bsod's again. I can maybe even install pvpanic driver that will nicely pause the guest for me?

webczat commented 8 years ago

and now it became weird because I cannot reproduce the issue any longer! I just redownloaded the cd directly on server and it works. but I am not so sure the first one was damaged. it worked on seabios, it worked on ovmf when run on my laptop, and the sha1 checksum was correct between the laptop's copy and the server's copy when I tried to upload it. I didn't check checksum the first time, but I did it the second time when uploading. I do not remember if I ever redownloaded this cd though. And now I installed 2 times, once with virtio blk and no pcie ports, once with virtio scsi and pcie ports, and for now it did not bsod. As you know, even if everything is because of a damaged iso image, I was unable to verify the image due to no checksums from msdn. What do you think?

lersek commented 8 years ago

@webczat: I propose the following:

If you get a panic in a week, we can investigate it. Otherwise, we can close this item as insufficient data end of next week or so.

@stefanha: does upstream QEMU support the flight recorder with systemtap? I couldn't find any more upstream documentation than the comments in scripts/kvm/kvm_flightrecorder. Thanks.

(@webczat: If we figure out how to enable the KVM flight recorder, then you could also utilize it for a week. It saves trace events into a ring buffer in host memory, and when you get a crash, you can dump out the trace. The ring buffer size is configurable, so you can influence memory consumption and how far the log will reach back.)

webczat commented 8 years ago

Oh, I can now tell you when this bug happens and when it does not. I cannot tell you why, however. I can probably confirm that it is a bug of ovmf. The reason is that the bug only happens when using a release build, even of a latest commit. When I build a debug build, even without the verbose flag, it works. Maybe the bug exists there too, but it may be at least harder to trigger. I am sorry for writing so many comments, I am just providing updates. This time, I just had an idea to rollback ovmf to the state it was before, like revert debug flag changes, checkout the commit i gave in the bug, then build a release, not debug build. And, all previous bsods appeared again. Then, I rebuilt ovmf but checking out the latest commit, and bsods were still there! I switched from release to debug build, then it stopped bsoding. It sounds weird, but seems to be true. What should I provide now? As said I cannot provide a ovmf debug log, but probably can do everything else. The above even explains one of the possible reasons why you could not reproduce this issue.

lersek commented 8 years ago

Very interesting, indeed I never build RELEASE binaries. Let me see if it reproduces on my end.

webczat commented 8 years ago

Even if it does not, then because the bug is pretty much always triggered by connected network cable, I could probably install pvpanic driver now, then trigger it manually. I believe I also had a release build on my laptop and it worked, but laptop is far slower. Could it be some kind of a timing issue, or something similar, maybe? If so, it is not guaranteed that you will be able to reproduce it.

lersek commented 8 years ago

Okay, since you apparently have a good, reliable reproducer, would you mind please testing the following two builds (no other changes to your environment, and please don't combine the following two points):

lersek commented 8 years ago

I would like to ask for the following info as well:

thanks

lersek commented 8 years ago

It does not reproduce for me. I've built a RELEASE binary for OVMF, and when I installed the NetKVM driver in the guest, nothing bad happened.

lersek commented 8 years ago

I've also done a disable/enable cycle on the virtio-net NIC in Windows, no problems.

webczat commented 8 years ago

If you had my configuration with pcie ports, then actually the bug would trigger without virtio net drivers, after second restart there would be a tcpip.sys bsod. if you would remove pcie ports, then the bug would be triggered by installing virtio network driver or rather by connecting the link. By connecting the link I mean the set_link command that you can use to connect or disconnect the virtual network cable. I mean the hmp command ofc. And now, other info you are asking me for: virtio drivers are at version 0.1.117 and I got them from ... redhat? fedora? not sure exactly. I have done the tests you proposed. I am not sure if I did it correctly, but setting rombar to 0 in virtio-net-pci had no visible effect. However, removing NETWORK_IP6_ENABLE and HTTP_BOOT_ENABLE did. I did not combine two at once as suggested above. Also, I did not try to login. the only working configuration from those, that is release build without http boot and ipv6 support, did not cause anything, but the one with v6, both when rombar was defaulted and set to 0, caused a bsod on boot, not sure if it even managed to show login screen.

lersek commented 8 years ago

If you had my configuration with pcie ports, then actually the bug would trigger without virtio net drivers, after second restart there would be a tcpip.sys bsod

My test script (see a bit higher up) does specify the two ioh3420 devices, and I haven't seen the BSOD yet. BTW, since you mention tcpip.sys, can you upload a screenshot of the BSOD?

set_link [...] hmp command

okay, I was missing this

Version 0.1.117 of NetKVM very likely came from https://fedoraproject.org/wiki/Windows_Virtio_Drivers , their latest version is 0.1.117 indeed.

So, if I understand your results correctly, with -D HTTP_BOOT_ENABLE -D NETWORK_IP6_ENABLE, you see the BSOD regardless of setting rombar=0. And, with those build options removed, you don't see the BSOD, again regardless of setting rombar=0.

Could you please determine which one of HTTP_BOOT_ENABLE and NETWORK_IP6_ENABLE makes the difference? Thanks.

lersek commented 8 years ago

BTW, you mentioned QEMU 2.5.2 -- that version doesn't exist.

webczat commented 8 years ago

maybe it was 2.5-2? not sure yet. that is version 2.5 build 2 (like second packaging in arch), maybe I just made a mistake. assume 2.5 version. I have tested what you wanted me to test. the result is: bsod persists without http boot support, but vanishes with http boot support and without ipv6. So, seems that the ipv6 support causes it all. If you want me to send you a screenshot of the bsod, then... here it is: http://webczatnet.pl/webczat/public/screenshot.png

I am not sure if I screenshotted the whole screen or just the vm/spice window, but you should see the bsod. It is, however, the polish system, so if you have any problems, please ask.

stefanha commented 8 years ago

@lersek Flight recorder tracing is supported upstream. The QEMU scripts/kvm/kvm_flightrecorder script uses kernel trace events instead of SystemTap. It will trace all the kvm.ko events listed by "perf list | grep kvm:".

For QEMU tracing you can use either ./configure --enable-trace-backends=ftrace - this works together with the trace events flight recorder approach above.

Or if you like SystemTap you can use --enable-trace-backends=dtrace. The qemu-*-simpletrace.stap file can generate a binary trace. Use SystemTap's flightrecorder mode (https://www.sourceware.org/systemtap/SystemTap_Beginners_Guide/using-usage.html#flight-recorder) to run and dump the ring buffer. You can pretty-print the trace using QEMU scripts/simpletrace.py --no-header ./trace-events path/to/ringbuffer.