intel / haxm

Intel® Hardware Accelerated Execution Manager (Intel® HAXM)
BSD 3-Clause "New" or "Revised" License
3.24k stars 878 forks source link

No support for NetBSD/amd64 guest #136

Closed krytarowski closed 5 years ago

krytarowski commented 5 years ago

NetBSD/amd64 as guest (tested with NetBSD as host) crashes during the kernel booting process.

I've recorded it:

ftp http://netbsd.org/~kamil/haxm/typescript-netbsd-2018-11-25
script -p ./typescript-netbsd-2018-11-25

( Port of NetBSD script(1) to Linux http://netbsd.org/~kamil/lldb/nbscript.c )

Log of HAXM kernel messages:

http://netbsd.org/~kamil/haxm/typescript-netbsd-2018-11-25-dmesg.txt

I was told that NetBSD crashes on a Windows host too (tested by @AlexAltea ).

AlexAltea commented 5 years ago

I was told that NetBSD crashes on a Windows host too (tested by @AlexAltea ).

Right. Here's the callstack:

[...]
IntelHaxm!KzLowerIrql+0x32
IntelHaxm!hax_enable_preemption+0x25
IntelHaxm!put_vmcs+0x114
IntelHaxm!cpu_vmx_execute+0x52e
IntelHaxm!vcpu_execute+0x12c
IntelHaxm!HaxVcpuControl+0x137
IntelHaxm!HaxIoControl+0xa3
[...]

And extra information via WinDBG, extracted from the crash dump:

BUILD_VERSION_STRING:  17134.1.amd64fre.rs4_release.180410-1804

DUMP_TYPE:  2
BUGCHECK_P1: c0000420
BUGCHECK_P2: fffff8004eaf5092
BUGCHECK_P3: fffff18206f26a60
BUGCHECK_P4: 0
EXCEPTION_CODE: (NTSTATUS) 0xc0000420 - An assertion failure has occurred.

FAULTING_IP: 
IntelHaxm+5092
fffff800`4eaf5092 cd2c            int     2Ch

CONTEXT:  fffff18206f26a60 -- (.cxr 0xfffff18206f26a60)
rax=0000000000000000 rbx=ffffac041dc12010 rcx=0000000000000001
rdx=fffff18206f27550 rsi=0000000000000001 rdi=ffffac0422effef0
rip=fffff8004eaf5092 rsp=fffff18206f27450 rbp=ffffac0422effef0
 r8=ffffffffe0000020  r9=000000008005003b r10=7ffffffffffffffc
r11=0000000000000000 r12=ffffac0422effef0 r13=ffffac042d84cd20
r14=0000000000000002 r15=0000000000000000
iopl=0         nv up di pl nz na pe nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000002
IntelHaxm+0x5092:
fffff800`4eaf5092 cd2c            int     2Ch
Resetting default scope

CPU_COUNT: 8
CPU_MHZ: af8
CPU_VENDOR:  GenuineIntel
CPU_FAMILY: 6
CPU_MODEL: 9e
CPU_STEPPING: 9
CPU_MICROCODE: 6,9e,9,0 (F,M,S,R)  SIG: 8E'00000000 (cache) 8E'00000000 (init)
BLACKBOXBSD: 1 (!blackboxbsd)
BLACKBOXPNP: 1 (!blackboxpnp)
CUSTOMER_CRASH_COUNT:  1
DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT
BUGCHECK_STR:  0x3B
PROCESS_NAME:  qemu-system-x86_64.exe
CURRENT_IRQL:  0
ANALYSIS_SESSION_HOST:  DESKTOP-PFTNRJD
ANALYSIS_SESSION_TIME:  11-25-2018 11:26:53.0867
ANALYSIS_VERSION: 10.0.17134.1 amd64fre

Sorry, I can't share the full dump file right now, since it contains some private information (since it comes from my main OS).

krytarowski commented 5 years ago

When does it crash for you? During the booting process? Or early just in/after bootloader?

AlexAltea commented 5 years ago

After the bootloader. Already while running the NetBSD kernel as guest.

raphaelning commented 5 years ago

@AlexAltea Have you added the directory that contains IntelHaxm.pdb to the WinDbg symbol search path? If you do that, you should be able to get a more useful stack trace with exact source files and line numbers. Example:

>echo %_NT_SYMBOL_PATH%
srv*c:\Symbols*http://msdl.microsoft.com/download/symbols;X:\path\to\haxm\platforms\windows\build\out\x64\Debug;
krytarowski commented 5 years ago

I see, for me it breaks after registering a floppy driver (is it incompatible with HAXM?).

Also I'm getting a similar EPT exception for other BSDs/Linux.. so probably something is inferior compared to win/mac/lin in my pending patch.

AlexAltea commented 5 years ago

@raphaelning Yeah, I did. Though I don't have the PDB file with me right now. I'll update my message regarding the crash with NetBSD guests this weekend.

AlexAltea commented 5 years ago

Sorry for the delay.

I've compiled and tested the latest master revision, 98c8126e, and it crashes. The corresponding driver, PDB file and Windows crash dump are available at: netbsd-crash.zip. This happens on a Windows 10 x64 (10.0.17134.407) host running QEMU 3.0.0 (v3.0.0-11723-ge2ddcc5879-dirty).

Surprisingly, with the same setup, but using HAXM v7.3.2, NetBSD guests boot fine (see image below), so this is a regression. I'll bisect the code to figure out why this happens.

capture

krytarowski commented 5 years ago

Wow, impressive! I'm looking forward to the bisecting results. While there, I'm getting more expertise to understand what (if anything) is wrong with NetBSD as host compared to others (mostly Linux).

AlexAltea commented 5 years ago

While bisecting the issue, every single revision down to tag v7.3.2 triggered the same host kernel panic, which is extremely surprising since the pre-built HAXM v7.3.2 works perfectly.

@raphaelning What revision was https://github.com/intel/haxm/releases/tag/v7.3.2 compiled from? I'm asking because the 89d591f commit mentioned in the tag triggers the issue, yet the installer attached installs a version of HAXM that successfully loads NetBSD guests.

EDIT: Even going down to tag v7.3.0 the same issue appears. The only difference I see is that the drivers are signed by Intel while during bisect I'm loading custom drivers.

raphaelning commented 5 years ago

@raphaelning What revision was https://github.com/intel/haxm/releases/tag/v7.3.2 compiled from? I'm asking because the 89d591f commit mentioned in the tag triggers the issue [...]

I thought v7.3.2 was built from that exact commit, but according to our records, it was actually from 3bb831a (Merge pull request #99 from intel/release-7.3.2). And the merge commit did make a difference: it ended up including 99bd00f (Add guest debugging support), which we didn't want to include in 7.3.2. That was an unfortunate mistake, sorry!

Even going down to tag v7.3.0 the same issue appears. The only difference I see is that the drivers are signed by Intel while during bisect I'm loading custom drivers.

We'll first try to reproduce the 7.3.2 release build from the said merge commit and see if it passes the NetBSD guest test. But FWIW, aside from the difference in driver signature, there are also small details in Debug vs Release build configurations (I assume you are using Debug, because Release doesn't sign the driver at all).

krytarowski commented 5 years ago

I wrote DB Registers support in the NetBSD kernel for i386/amd64. It landed version 8.0.

The NetBSD kernel reads DR0-4 and 6-7 registers on boot and stores them internally. Later users of ptrace(2) with option PT_GETDBREGS can read these values and they are presented as the initial ones for each process.

Not sure if that has any impact on HAXM.

krytarowski commented 5 years ago

This happens exactly in this function:

     56 void
     57 x86_dbregs_init(void)
     58 {
     59     /* DR0-DR3 should always be 0 */
     60     initdbstate.dr[0] = rdr0();
     61     initdbstate.dr[1] = rdr1();
     62     initdbstate.dr[2] = rdr2();
     63     initdbstate.dr[3] = rdr3();
     64     /* DR4-DR5 are reserved - skip */
     65     /* DR6 and DR7 contain predefined nonzero bits */
     66     initdbstate.dr[6] = rdr6();
     67     initdbstate.dr[7] = rdr7();
     68     /* DR8-DR15 are reserved - skip */
     69 
     70     /*
     71      * Explicitly reset some bits just in case they could be
     72      * set by brave software/hardware before the kernel boot.
     73      */
     74     initdbstate.dr[6] &= ~X86_BREAKPOINT_CONDITION_DETECTED;
     75     initdbstate.dr[7] &= ~X86_DR7_GENERAL_DETECT_ENABLE;
     76 
     77     pool_init(&x86_dbregspl, sizeof(struct dbreg), 16, 0, 0, "dbregs",
     78         NULL, IPL_NONE);
     79 }

https://nxr.netbsd.org/xref/src/sys/arch/x86/x86/dbregs.c#57

raphaelning commented 5 years ago

99bd00f (#81) turns out to be irrelevant. I just loaded @AlexAltea's BSOD dump in WinDbg, and noticed the following:

EXCEPTION_CODE: (NTSTATUS) 0xc0000420 - An assertion failure has occurred.

(It was already mentioned in https://github.com/intel/haxm/issues/136#issuecomment-441430361, but I missed it...) I think assertions are only enabled in Debug builds, which explains why the crash doesn't happen with the v7.3.2 release. The BSOD analysis also gives us the full stack trace:

STACK_TEXT:  
ffffa908`e05cf4d0 fffff800`06375245 : ffffd004`666a0d01 ffffe26e`2865975b ffffd002`00000000 ffffd004`66e4de10 : IntelHaxm!KzLowerIrql+0x32 [c:\program files (x86)\windows kits\10\include\10.0.17134.0\km\wdm.h @ 18083] 
ffffa908`e05cf510 fffff800`0638b644 : ffffa908`e05cf5d0 00000000`8005003b ffffd004`666d5e00 ffffffff`e0000020 : IntelHaxm!hax_enable_preemption+0x25 [<REDACTED>\haxm\platforms\windows\hax_wrapper.c @ 228] 
ffffa908`e05cf540 fffff800`0638abae : ffffd004`66c5b010 ffffa908`e05cf5d0 00000000`00000000 00000000`00000000 : IntelHaxm!put_vmcs+0x114 [<REDACTED>\haxm\core\cpu.c @ 608] 
ffffa908`e05cf590 fffff800`0637bddc : ffffd004`66c5b010 ffffe581`b31ab000 ffffa908`e05cf5c8 00000000`00000000 : IntelHaxm!cpu_vmx_execute+0x52e [<REDACTED>\haxm\core\cpu.c @ 445] 
ffffa908`e05cf5f0 fffff800`06372e77 : ffffd004`66c5b010 1a000000`00000000 0a000000`00000001 00000000`00458054 : IntelHaxm!vcpu_execute+0x12c [<REDACTED>\haxm\core\vcpu.c @ 1687] 
ffffa908`e05cf640 fffff800`06372ca3 : ffffd004`68412b30 ffffd004`68412c88 ffffd004`66073710 00002000`1aa44860 : IntelHaxm!HaxVcpuControl+0x137 [<REDACTED>\haxm\platforms\windows\hax_entry.c @ 282] 
ffffa908`e05cf740 fffff800`3943eef9 : ffffd004`68412b30 ffffd004`66073710 ffff9781`4b7762c0 fffff800`397d2f00 : IntelHaxm!HaxIoControl+0xa3 [<REDACTED>\haxm\platforms\windows\hax_entry.c @ 733] 
ffffa908`e05cf780 fffff800`398ea1cb : ffffd004`66073710 ffffa908`e05cfb00 00000000`00000001 00000000`00000000 : nt!IofCallDriver+0x59
ffffa908`e05cf7c0 fffff800`398e987a : ffffd004`00000000 ffffd004`69341f40 00000000`00000000 ffffa908`e05cfb00 : nt!IopSynchronousServiceTail+0x1ab
ffffa908`e05cf870 fffff800`398ea006 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!IopXxxControlFile+0x68a
ffffa908`e05cf9a0 fffff800`395bdd43 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!NtDeviceIoControlFile+0x56
ffffa908`e05cfa10 00007ffe`6bfd9fe4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`047bfc88 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007ffe`6bfd9fe4

as well as the failed assertion:

FAULTING_SOURCE_CODE:  
 18079: --*/
 18080: 
 18081: {
 18082: 
>18083:     NT_ASSERT(KeGetCurrentIrql() >= NewIrql);
 18084: 
 18085:     WriteCR8(NewIrql);
 18086:     return;
 18087: }
 18088: 

I'll look into this.

raphaelning commented 5 years ago

I've implemented a quick and dirty fix at #145. The problem is that NetBSD writes 0 to CR8, but HAXM doesn't virtualize CR8 at all, so the host value gets overwritten. Windows x86_64 uses CR8 to store the current IRQL (which is evident from the fact that KeLowerIrql() simply calls WriteCR8()), and will crash if HAXM (actually the guest) changes the value behind its back:

  1. Before VM entry, hax_disable_preemption() calls KeRaiseIrql() to raise IRQL (CR8) from APC_LEVEL == 1 to DISPATCH_LEVEL == 2.
  2. Guest (NetBSD) uses MOV to overwrite CR8 with 0.
  3. After VM exit, hax_enable_preemption() calls KeLowerIrql() to restore the old CR8 value (1), but this violates the assertion, since the current IRQL (0) is even lower.