tianocore / edk2

EDK II
https://github.com/tianocore/tianocore.github.io/wiki/EDK-II
Other
4.64k stars 2.52k forks source link

random bsod of windows10 #94

Closed webczat closed 8 years ago

webczat commented 8 years ago

Hello.

When I install windows10 on qemu with enabled kvm and the ovmf firmware on my dedicated server, the installed guest system experiences many quite random blue screens. Depending on the current device configuration of the machine, the blue screen can be provoked by installing the network driver (only if the link is connected) or by connecting the network link after driver install, sometimes those blue screens happen randomly during windows usage, sometimes at regular intervals, sometimes it is enough to wait minute after oob to experience them, and it always happens nearly at the same time until machine reconfigured, at which point it may be that bsods will happen in a different way. There is no good configuration. I tried using virtio-net driver and changing the virtio net card to e1000, tried to use virtio blk and virtio scsi for hard drives, add or remove pcie ports (yes, id made a difference for some reason), tried to change number of cpus or cpu type, without any success. The only thing I didn't try is changing machine type from q35 to something else. So, windows10 guest just does not work in my environment when using ovmf. It seems that all those things that I have experienced do not happen at all when using seabios, and I have tested with few configurations. This is every piece of information I can give about my setup and host/guest hardware/software:

host processor: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz memory: 16 gigabytes of ram linux kernel version: 4.5.4 qemu version: 2.5.2 edk2 commit: 3ab7066e8d8ae43d9cdee76600b90918f8bee5d9 ovmf build command: build -DSECURE_BOOT_ENABLE -DNETWORK_IP6_ENABLE -DHTTP_BOOT_ENABLE guest windows version: 10 professional n, build 1511

This is one of the worst configurations of the vm that actually causes the bsod almost immediately after showing oob:

!/bin/sh

umask 0077 cd ~/vms/vm /bin/qemu-system-x86_64 \ -name vm -D vm.log \ -spice tls-port=25000,x509-cacert-file=../ca/cacert.pem,x509-cert-file=cert.pem,x509-key-file=key.pem,x509-dh-key-file=dh.pem,password=mypass \ -machine q35,accel=kvm,iommu=on -nodefaults -cpu host \ -chardev stdio,id=interface,mux=on,signal=off \ -chardev socket,id=control,server,nowait,path=control.sock \ -mon chardev=interface,mode=readline -mon chardev=control,mode=control \ -smp cores=4 -m size=2048M \ -device ib700 -device pvpanic -device isa-serial,chardev=interface \ -device ioh3420,port=1,chassis=1,id=pcieport1 \ -device ioh3420,id=pcieport2,port=2,chassis=2 \ -device virtio-balloon-pci,disable-modern=false \ -device virtio-rng-pci,disable-modern=false \ -device virtio-serial-pci,disable-modern=false \ -chardev socket,id=agent,server,nowait,path=agent.sock \ -device virtserialport,chardev=agent,name=org.qemu.guest_agent.0 \ -chardev spicevmc,name=vdagent,id=vdagent \ -device virtserialport,chardev=vdagent,name=com.redhat.spice.0 \ -device qxl-vga \ -device ich9-intel-hda -device hda-micro \ -netdev bridge,br=vmnet,id=net \ -device virtio-net-pci,disable-modern=false,netdev=net \ -device virtio-scsi-pci,disable-modern=false \ -device nec-usb-xhci -device nec-usb-xhci \ -chardev spicevmc,name=usbredir,id=usbredir1 \ -device usb-redir,chardev=usbredir1 \ -chardev spicevmc,name=usbredir,id=usbredir2 \ -device usb-redir,chardev=usbredir2 \ -device usb-tablet \ -drive id=cd1,media=cdrom,if=none \ -device ide-cd,drive=cd1,bus=ide.0 \ -drive id=cd2,if=none,media=cdrom \ -device ide-cd,drive=cd2,bus=ide.1 \ -drive id=disk,if=none,media=disk,format=qcow2,file=disk.img \ -device scsi-hd,drive=disk \ -drive if=pflash,media=disk,format=raw,file=firmware.bin \ -drive if=pflash,media=disk,format=raw,file=vars.bin \ -watchdog-action reset -rtc base=localtime \ -boot menu=on \ $@

lersek commented 8 years ago

The screenshot is alright, I also used google translate to understand it. I googled the error message, and found nothing useful. :(

I retried with QEMU v2..5.0, no BSOD. Sigh.

Can you try grabbing a KVM trace after the second restat? When the tianocore logo is displayed, press ESC. That will enter the UEFI setup menu, and give you time for starting KVM tracing. Then you can resume the boot process with the Continue option. (I'm grasping at straws, sorry -- it's maddening that it only reproduces on your machine.)

webczat commented 8 years ago

At this time I am not even reinstalling the system between tests! like when you recompile firmware without ipv6 the bsod does not happen, with ipv6 it does, that is all. not a fresh guest each time anymore, so kvm tracing that is far, far simpler for me now. Unless you have any specific reason for me to reinstall again. I would not be able to be fast enough to press esc after second restart of reinstall as I am not constantly watching this window, and I am blind, so even showing it somewhere to be visible would not help. I will try doing a trace at this time.

webczat commented 8 years ago

It is large. Should I give a dat file or the report? here is a dat file http://webczatnet.pl/webczat/public/trace.dat No report. why? because I get this error when trying to generate it: cound not load plugin '/usr/lib/trace-cmd/plugins/plugin_kvm.so' /usr/lib/trace-cmd/plugins/plugin_kvm.so: undefined symbol: ud_translate_att

lersek commented 8 years ago

@stefanha thank you for the info!

lersek commented 8 years ago

@webczat thank you for the dat file. I hope I can translate it to a textual report on my end, although it is really big. Downloading...

(Also it's really impressive you can do all this while being blind; I hope I haven't given you a hard time with all my requests.)

lersek commented 8 years ago

BTW can you reproduce it if you keep IPv6 enabled, but disable the secure boot feature?

lersek commented 8 years ago

(Also, such trace DATs compress really well... for the next time :))

webczat commented 8 years ago

With everything except secureboot enabled, there is no bsod. at least it seems so. I didn't know secureboot takes so much space!

lersek commented 8 years ago

yeah the SB feature embeds a nice large subset of OpenSSL in the firmware binary.

So, I looked at the KVM trace. Translated to text form, it is 3.3G in size. It is unmanageable. I tried to employ various "tricks" (grepping, sorting, counting unique lines and looking at low-frequency lines, etc), but I'm not seeing anything bad in it. Of course, even if there is anything bad in it, I would likely miss it, at this size.

Now, the question about secure boot and ipv6 is relevant for the following reason. In the OVMF DSC files, we include the NetworkPkg/IScsiDxe/IScsiDxe.inf driver only if both SB and IPv6 support are built into the firmware. Otherwise (if either flag is missing), we build in MdeModulePkg/Universal/Network/IScsiDxe/IScsiDxe.inf. (See commit 36c6413f76e5f.)

Thus far you have reported that keeping SB enabled, and flipping just IPv6 back and forth controls the BSOD. Similarly, keeping IPv6 enabled, flipping SB back and forth, it also controls the BSOD.

I suspect that NetworkPkg/IScsiDxe/IScsiDxe.inf is somehow related to the crash. For example, it is able to install an ACPI table called IBFT, and the guest could definitely see that.

Or else... what happens if you restore the guaranteed-to-BSOD setup (both IPv6 and SB compiled in), but change the network backend? In your setup you have -netdev bridge,br=vmnet -- do you have some DHCP server on that network for example? What happens with -netdev user?

I'm trying to figure out if some special traffic on your network contributes to this BSOD. I guess that would be consistent with the fact that your laptop (and my laptop too, for that matter) are on different physical subnets than your server. (Given that you employ bridged networking, the VM is affected by the traffic on the server's subnet.)

I've really never seen anything like this before, and debugging it from afar can only lead to semi-random ideas, sorry :(

webczat commented 8 years ago

I may be able to test it in a while. I am actually bridging to an ipv6 tunnel and to an ipv4 network, so I run dhcpv6/dhcpv4/radvd. About secureboot/ipv6 being flipped: so then why if everything is enabled, the thing works properly in case it is a debug build? About testing other net backend, this is your first bad guess. changing it to -netdev user,id=net does not help, it bsods too. Ahh, this time even removing network cable does not stop bsods, because of one of the other devices like pcieport or virtio scsi, or both of them. Removing scsi requires reinstall so not sure if I will even try for now.

webczat commented 8 years ago

I still think that even if the issue is related to what you specify, it may be something about timing. you cannot reproduce it, I cannot on a laptop, but can here, and only if debug is not enabled. debug always makes everything a little slower. that is my guess, although I cannot tell which thing is a problem here of course.

DemiMarie commented 8 years ago

One thought (not specific to OVMF) is that some C code has undefined behavior, and thus compiler optimizations make assumptions that turn out to be false. The results at runtime are unpredictable.

Does building with -fno-strict-aliasing -fwrapv -fno-delete-null-pointer-checks prevent the crash?

webczat commented 8 years ago

ouh seriously? where do you change that?

webczat commented 8 years ago

Okay. I changed the line in the ovmf dsc file starting with GCC:____*_CC_FLAGS

Yes, cleaned the build dir before building. No change.

webczat commented 8 years ago

@lersek I once tried something like that: completely wiped out all traces of iscsi from ovmf that is modified fdf and dsc file. I did not even use the alternative iscsi, just disabled all iscsi modules. then compiled it, and it started to work! the odd part is that when I restored iscsi support it works too. well

webczat commented 8 years ago

okay, I may be wrong. I readded pcieports that I have removed, the bug returned. I removed iscsi but this time left the alternative iscsi implementation enabled, bsod is still there.

DemiMarie commented 8 years ago

@webczat I didn't. However, undefined behavior is a VERY common cause of "My program works only in debug mode!"-type failures. -fwrapv -fno-strict-aliasing -fno-delete-null-pointer-checks makes signed integer overflow, strict aliasing violations, and dereferencing NULL pointers have defined behavior, and I believe the Linux kernel uses these flags.

lersek commented 8 years ago

@webczat: sorry, I'm out of ideas.

I googled the error that's visible on your BSOD screenshot -- apparently it is a quite frequent error on physical machines as well (even "brand new ones", as I've learned from various customer complaints on the tubes), so it's not like tcpip.sys itself is rock stable.

I think the BSOD saves a "minidump" file somewhere on the Windows partition, in the guest, and that file could be analyzed with WinDbg somehow. (Although, the examples I've found about it on the net simply stated information that had already been known -- what module crashed etc, but not the why.)

I have worked with WinDbg once or twice before, and IME it's unpredictable whether such a debugging session, without having access to the Windows source code, produces results. (Obviously I have no access to the tcpip.sys source code.) On one occasion I had luck with WinDbg and could deduce / prove the bug (which was actually in Windows's acpi.sys); on another occasion (debugging the boot BSOD on Ia32 OVMF) I couldn't make any debugging progress beyond a specific point.

If the crashing guest OS were Linux, I would obviously dig into the Linux source code, but with a Windows guest, it seems futile -- there's nothing to dig into. At this point I can only recommend that you modify either your QEMU command line or your OVMF build flags, so as not to trigger the problem. I'm sorry -- I don't think I can help.

We can leave this report open if you wish, although I'm not sure for how long we should do that. We still lack hard evidence that the bug is in the OVMF binary, and not in Windows (triggered / exposed by something in the OVMF binary).

webczat commented 8 years ago

I believe it is ovmf for such a reason that, well, windows would not see a difference between a debug and release binary, would it? and it is the first thing that triggers that bug, by itself. I am still convinced it may be related to some timing or something, but well... The bsod itself is not necessarily triggered by ovmf. maybe something that ovmf does makes the system install some driver in a wrong way or something and it is this that triggers the bsod later... You can change the bsod placement slightly by changing devices. Yes, I understand it is extremely hard. I could probably try and provide a minidump. In any case I have a working windows install with debug ovmf, and it is stable.

lersek commented 8 years ago

I guess I can look at your minidump if you provide me with one (*), but I'll also need the debug symbols that exactly match your guest OS. Or does WinDbg automatically recognize the minidump and download matching symbols?

(*) I'm just noticing you can actually attach files to github issue reports -- under the new comment box, there's a link saying "Attach files by dragging & dropping or selecting them." Perhaps if you compress the minidump with xz -9ev, it will be small enough to attach.

webczat commented 8 years ago

will see. still it is possile you don't find anything especially if the bsod is not directly ovmf triggered. I will provide a minidump when/if possible.

lersek commented 8 years ago

@webczat The Bugzilla server for edk2 has gone live, and GitHub issues will soon be migrated: http://thread.gmane.org/gmane.comp.bios.edk2.devel/14844

It's been six weeks since our last activity in this item. Are you still facing the symptoms you reported? If you have found an alternative or a workaround, I'd like to close this issue, rather than carry it over to the new Bugzilla.

Thanks, Laszlo

lersek commented 8 years ago

If necessary, this item should be continued in https://tianocore.acgmultimedia.com/show_bug.cgi?id=58