Closed pbraun9 closed 6 years ago
Take a look at https://github.com/tklengyel/drakvuf/wiki/Debugging-DRAKVUF
LVM is not mandatory, you can use whatever disk device you want. I don't think this issue is related to the disk.
with a new guest called xenial3
being an Ubuntu Xenial, no stdout, just the crash
(XEN) d15v0 vmentry failure (reason 0x80000021): Invalid guest state (0)
(XEN) ************* VMCS Area **************
(XEN) *** Guest State ***
(XEN) CR0: actual=0x000000008005003b, shadow=0x0000000080050033, gh_mask=ffffffffffffffff
(XEN) CR4: actual=0x0000000000362670, shadow=0x0000000000360670, gh_mask=ffffffffffffffff
(XEN) CR3 = 0x8000000017464000
(XEN) PDPTE0 = 0x0000000000000000 PDPTE1 = 0x0000000000000000
(XEN) PDPTE2 = 0x0000000000000000 PDPTE3 = 0x0000000000000000
(XEN) RSP = 0x00007f3d6b8ccc38 (0x00007f3d6b8ccc38) RIP = 0xffffffff8184ef2d (0xffffffff8184ef2d)
(XEN) RFLAGS=0x00000006 (0x00000006) DR7 = 0x0000000000000400
(XEN) Sysenter RSP=0000000000000000 CS:RIP=0010:ffffffff81851f60
(XEN) sel attr limit base
(XEN) CS: 0010 0a09b ffffffff 0000000000000000
(XEN) DS: 0000 1c000 ffffffff 0000000000000000
(XEN) SS: 0018 0c093 ffffffff 0000000000000000
(XEN) ES: 0000 1c000 ffffffff 0000000000000000
(XEN) FS: 0000 1c000 ffffffff 00007f3d6b8cd700
(XEN) GS: 0000 1c000 ffffffff ffff88001f400000
(XEN) GDTR: 0000007f ffff88001f40c000
(XEN) LDTR: 0000 1c000 ffffffff 0000000000000000
(XEN) IDTR: 00000fff ffffffffff574000
(XEN) TR: 0040 0008b 00002087 ffff88001f4048c0
(XEN) EFER = 0x0000000000000000 PAT = 0x0407010600070106
(XEN) PreemptionTimer = 0x00000000 SM Base = 0x00000000
(XEN) DebugCtl = 0x0000000000000000 DebugExceptions = 0x0000000000000000
(XEN) PerfGlobCtl = 0x0000000000000000 BndCfgS = 0x0000000000000000
(XEN) Interruptibility = 00000000 ActivityState = 00000000
(XEN) *** Host State ***
(XEN) RIP = 0xffff82d08030a140 (vmx_asm_vmexit_handler) RSP = 0xffff83050fd47f90
(XEN) CS=e008 SS=0000 DS=0000 ES=0000 FS=0000 GS=0000 TR=e040
(XEN) FSBase=0000000000000000 GSBase=0000000000000000 TRBase=ffff83050fd4ec80
(XEN) GDTBase=ffff83050fd3e000 IDTBase=ffff83050fd4a000
(XEN) CR0=000000008005003b CR3=0000000457286000 CR4=00000000003526e0
(XEN) Sysenter RSP=ffff83050fd47fc0 CS:RIP=e008:ffff82d080348ba0
(XEN) EFER = 0x0000000000000000 PAT = 0x0000050100070406
(XEN) *** Control State ***
(XEN) PinBased=0000003f CPUBased=b6a0e5fa SecondaryExec=001254eb
(XEN) EntryControls=000153ff ExitControls=008fefff
(XEN) ExceptionBitmap=0006008a PFECmask=00000000 PFECmatch=00000000
(XEN) VMEntry: intr_info=000000f3 errcode=00000000 ilen=00000000
(XEN) VMExit: intr_info=00000000 errcode=00000000 ilen=00000003
(XEN) reason=80000021 qualification=0000000000000000
(XEN) IDTVectoring: info=00000000 errcode=00000000
(XEN) TSC Offset = 0xffffdf608612721d TSC Multiplier = 0x0000000000000000
(XEN) TPR Threshold = 0x00 PostedIntrVec = 0x00
(XEN) EPT pointer = 0x000000041774f01e EPTP index = 0x0000
(XEN) PLE Gap=00000080 Window=00001000
(XEN) Virtual processor ID = 0x08d6 VMfunc controls = 0000000000000000
(XEN) **************************************
(XEN) domain_crash called from vmx.c:3337
(XEN) Domain 15 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-4.9.1 x86_64 debug=n Not tainted ]----
(XEN) CPU: 3
(XEN) RIP: 0010:[<ffffffff8184ef2d>]
(XEN) RFLAGS: 0000000000000006 CONTEXT: hvm guest (d15v0)
(XEN) rax: 8000000017464000 rbx: 00000000000012da rcx: 00007ffdd43e9b39
(XEN) rdx: 0000000000000000 rsi: 00007f3d6b8ccc90 rdi: 0000000000000001
(XEN) rbp: 00007f3d6b8ccc60 rsp: 00007f3d6b8ccc38 r8: 0000000000000007
(XEN) r9: 0000000000000001 r10: 00007f3d64001880 r11: 0000000000000246
(XEN) r12: 00007f3d6b8ccc44 r13: 00188de0c5800000 r14: 0000000000000000
(XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 0000000000360670
(XEN) cr3: 8000000017464000 cr2: 00007f1368670090
(XEN) fsb: 00007f3d6b8cd700 gsb: ffff88001f400000 gss: 0000000000000000
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0018 cs: 0010
This looks a like vmx_failed_vmentry
, so this is quite a low-level bug. I would also like to see the DRAKVUF debug logs to see what happens there, but to me this looks like a strange Xen issue. Perhaps try to upgrade your Xen installation to Xen 4.10 and see if you still have the problem
With XEN 4.10.0 and Drakvuf recompiled against latest libvmi git repository, I got the same issue, this for the previously used xenial2
with vcups = 2
, maxcups = 2
, altp2m = 2
. That last parameter needs to reflect the number of vcpus?
DRAKVUF debug: xenial2.debug___.stderr.txt
XEN dmesg xen.xenial2.crash.txt
Update: XEN 4.10.0 dmesg starts with (XEN) parameter "flask_enforcing" unknown!
, which does not look good.
I understand now XSM/Flask is NOT required for Drakvuf to run. So the flask*
boot argument is not either. As for alt2pm=X
I do not find any hint anywhere on how to define X
.
@tklengyel, hi, am I providing the required material, namely the Drakvuf debug output as requested? Here's another one: drakvuf debug / xen 4.9.1 / libvmi 0.13 / rekall 1.7.1
No, on the Xen command line altp2m is just a boolean parameter. Refer to https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html if you want more details. This is not a bug in DRAKVUF but a bug in Xen but from the logs posted I can't see what is wrong. Your best bet would be either to debug this yourself or ask for help on the xen-devel mailinglist.
flask_enforcing=
and flask=
are xen boot arguments (resp. 4.9-? vs 4.10) and on the other hand alt2pm=
is a guest configuration (and is not mentioned in the xen command line guide at all). Sorry I mentioned both without any transition. The alt2pm=2
is taken from the guest configuration from the tutorial on drakvuf.com. So I guess my question remains.
No, altp2m is both a Xen command line argument as detailed in the document I linked, and a guest configuration option. You have to have both enabled properly for DRAKVUF to work. But neither that nor the flask option being enabled (or not) would cause the failed vmentry you are seeing. Your bug is likely somewhere else.
Right, it is altp2m
not alt2pm. But I am still wondering how to define that setting regarding the guest configuration.
Back to our issue, I played around with the hap
setting (both on xen.gz and guest config) and observed that when ever it was disabled for the guest, the whole host was crashing.
Then as long hap
is enabled either by default on for that guest, I got back to my xen-bug with a Drakvuf debug trace that looks good (why where previous traces so much larger?):
# src/drakvuf -v -r /root/xenial3.json -d xenial3
DRAKVUF v0.6-1c2a1b0
Starting DRAKVUF initialization
Init VMI on domID 1 -> xenial3
Max GPFN: 0xff001
Max mem set? 0
Physmap populated? 0
Altp2m enabled? 0
Altp2m view X created? 0 with ID 1
Altp2m view R created? 0 with ID 2
Switched Altp2m view to X? 0
libdrakvuf initialized
DRAKVUF initializated
Starting plugins
Starting plugin syscalls
Starting plugin syscalls finished
Starting plugin poolmon
Starting plugin filetracer
Starting plugin filedelete
Starting plugin objmon
Starting plugin exmon
Starting plugin ssdtmon
Starting plugin debugmon
Starting plugin debugmon finished
Starting plugin cpuidmon
Starting plugin cpuidmon finished
Starting plugin socketmon
Starting plugin regmon
Starting plugin procmon
Beginning DRAKVUF loop
Started DRAKVUF loop
^CDRAKVUF loop finished
Finished DRAKVUF loop
starting close_vmi_drakvuf
close_vmi_drakvuf finished
So for the record, to me it looks like for Drakvuf to be able to run, hap=false
is NOT required as a xen.gz boot argument. In the end I am doing my tests and troubleshooting with this reduced set of parameters: (XEN) Command line: dom0_mem=12288M altp2m=1
. As for the guest configuration, I suppose that reduced set of settings would also be good. I am not sure about maxmem
vs memory
though.
#DRAKVUF
altp2m=1
#HVM
type = "hvm"
boot = "cd"
sdl = 1
#PV
name = "xenial3"
memory = 512
vcups = 1
disk = ['tap:aio:/data/guests/xenial3/xenial3.disk,xvda,w',
'file:/data/ISO-IMAGES/ubuntu-16.04.4-server-amd64.iso,hdc:cdrom,r']
vif = [ 'vifname=xenial3.0' ]
There is no such boot param as hap=false
. The 1gb/2mb page sizes are being turned off are not required but advised to be there otherwise Xen will have to shatter pages at runtime. These likely have nothing to do with your crash.
There is such parameter as hap=
, I am not making it up by myself,
hap (x86)
= <boolean>
Default: true
Flag to globally enable or disable support for Hardware Assisted Paging (HAP)
Ok, I have upgraded my machine's firmware and I am up for another round of testing.
For the record, the setup I am using now. The guest config:
type = "hvm"
sdl = 1
altp2m = 2
maxmem = 512
name = "devuanhvm"
#memory = 512
vcups = 2
disk = ['tap:qcow2:/data/guests/devuanhvm/devuanhvm.qcow2,xvda,w',
'file:/data/ISO-IMAGES/devuan.iso,hdc:cdrom,r']
vif = [ 'vifname=devuanhvm.0' ]
The XEN 4.10.0 (+XSM but disabling it with flask=disabled
) boot options:
dom0_mem=4096M,max:4096M dom0_max_vcpus=1 dom0_vcpus_pin=true hap_1gb=false hap_2mb=false altp2m=1 flask=disabled
If not using vcpus and/or pin, running Drakvuf crashes the whole machine. I did not investigate this further although loglvl=all noreboot
might help. So I guess I found out why the host was crashing.
About the guest crashes, well it still happens, and here is a new snipped of debug output:
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4fa0 vCPU 0 altp2m 0
Pre mem cb with vCPU 0 @ 0xb0e4fa0 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4fa0 vCPU 0 altp2m 0
Pre mem cb with vCPU 0 @ 0xb0e4f80 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4f80 vCPU 0 altp2m 0
Pre mem cb with vCPU 0 @ 0xb0e4fa0 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4fa0 vCPU 0 altp2m 0
Pre mem cb with vCPU 0 @ 0xb0e4fa0 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4fa0 vCPU 0 altp2m 0
Pre mem cb with vCPU 0 @ 0xb0e4f88 in view 1: rw-
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4f88 vCPU 0 altp2m 0
Re-copying remapped gfn
Pre mem cb with vCPU 0 @ 0xb0e4f88 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4f88 vCPU 0 altp2m 0
Pre mem cb with vCPU 0 @ 0xb0e4f78 in view 1: rw-
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4f78 vCPU 0 altp2m 0
Re-copying remapped gfn
Pre mem cb with vCPU 0 @ 0xb0e4f80 in view 1: rw-
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4f80 vCPU 0 altp2m 0
Re-copying remapped gfn
Pre mem cb with vCPU 0 @ 0xb0e4fa0 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4fa0 vCPU 0 altp2m 0
Pre mem cb with vCPU 0 @ 0xb0e4fa0 in view 1: rw-
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4fa0 vCPU 0 altp2m 0
Re-copying remapped gfn
Pre mem cb with vCPU 0 @ 0xb0e4fa0 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0e4fa0 vCPU 0 altp2m 0
[SYSCALL] TIME:1528270679.635098 VCPU:0 CR3:0x1eb4a000,"kworker/0:1" UID:0 linux!sys_imageblit
Switching altp2m and to singlestep on vcpu 0
reset trap on vCPU 0, switching altp2m 0->1
Pre mem cb with vCPU 0 @ 0xb127980 in view 1: r--
[...]
Pre mem cb with vCPU 0 @ 0xb0d98a0 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0d98a0 vCPU 0 altp2m 0
[SYSCALL] TIME:1528270679.861790 VCPU:0 CR3:0x1eb4a000,"kworker/0:1" UID:0 linux!sys_imageblit
Switching altp2m and to singlestep on vcpu 0
reset trap on vCPU 0, switching altp2m 0->1
Pre mem cb with vCPU 0 @ 0xb0d98a0 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0d98a0 vCPU 0 altp2m 0
[SYSCALL] TIME:1528270680.065850 VCPU:0 CR3:0x1eb4a000,"kworker/0:1" UID:0 linux!sys_imageblit
Switching altp2m and to singlestep on vcpu 0
reset trap on vCPU 0, switching altp2m 0->1
Pre mem cb with vCPU 0 @ 0xb0d98a0 in view 1: r--
[...]
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0xb0d98a0 vCPU 0 altp2m 0
[SYSCALL] TIME:1528270680.271010 VCPU:0 CR3:0x1eb4a000,"kworker/0:1" UID:0 linux!sys_imageblit
Switching altp2m and to singlestep on vcpu 0
reset trap on vCPU 0, switching altp2m 0->1
Does this possibly reveal some other root cause or am I still facing a probable xen bug?
@tklengyel I am still wondering how to use the altp2m=
guest setting! Should it match the number of vcpus?
Thanks
The hap
option is not a boot param, it is a guest configuration option. I didn't say you made it up, I think you are just confusing Xen boot options and guest config options. Those are very different.
The altp2m guest config option doesn't have to match the number of vcpus. Read the documentation https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html
The logs you posted don't explain why your guest crashes.
@tklengyel, but the title reads Xen Hypervisor Command Line Options
, so I guess the hap=
setting does exist. Maybe this is a new setting. https://xenbits.xen.org/docs/4.9-testing/misc/xen-command-line.html https://xenbits.xen.org/docs/4.10-testing/misc/xen-command-line.html https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html
Ok thank you, as for the guest configuration. 2
seems to correspond to "external"
.
Please try upgrading your Xen installation to Xen 4.11 rc6 and post if it solves your problem
Hi. Now with XEN 4.11 rc6 and DRAKVUF v0.6-3868a26, the behavior has changed. The DRAKVUF process ends by itself with an error message:
DRAKVUF v0.6-3868a26
Starting DRAKVUF initialization
drakvuf_event_fd_add fd=14
size of list=1
regenerating event_fds and fd_info_lookup...
new event_fd i=0 for fd=14
new fd_info_lookup i=0 for fd=14
drakvuf_init: adding event_fd done
Init VMI on domID 4 -> devuanhvm
init_vmi: initializing vmi done
Max GPFN: 0xff001
Max mem set? 0
Physmap populated? 0
Altp2m enabled? 0
Altp2m view X created? 0 with ID 1
Altp2m view R created? 0 with ID 2
Switched Altp2m view to X? 0
libdrakvuf initialized
DRAKVUF initializated
Starting plugins
Starting plugin syscalls
Rekall profile: no $FUNCTIONS section found
Rekall profile defines 75360 symbols
Received 75360 symbols
[...]
Physmap populated? 0
Copied trapped page to new location
Activating remapped gfns in the altp2m views!
Trap added @ PA 0x19904440 RPA 0xff07d440 Page 104708 for sys_acct.
Trap added @ PA 0x19a048e0 RPA 0xff0138e0 Page 104964 for sys_access.
Trap added @ PA 0x19ced9a0 RPA 0xff0229a0 Page 105709 for sys_accept4.
Trap added @ PA 0x19ced9b0 RPA 0xff0229b0 Page 105709 for sys_accept.
Starting plugin syscalls finished
Starting plugin poolmon
Starting plugin filetracer
Starting plugin filedelete
Starting plugin objmon
Starting plugin exmon
Starting plugin ssdtmon
Starting plugin debugmon
Starting plugin debugmon finished
Starting plugin cpuidmon
Starting plugin cpuidmon finished
Starting plugin socketmon
Starting plugin regmon
Starting plugin procmon
Beginning DRAKVUF loop
Started DRAKVUF loop
VMI_ERROR: Error, Xen reports a VM_EVENT_INTERFACE_VERSION that doesn't match what we expected (0x00000002)!
Error waiting for events or timeout, quitting...
DRAKVUF loop finished
Finished DRAKVUF loop
starting close_vmi_drakvuf
Removed memtrap for GFN 0xff002 in altp2m view 1
close_vmi_drakvuf finished
Also, Instead of going (null)
until Drakvuf ends and then simply disappearing, the guest now remains with State ------
and its consoles, SDL and serial, do not respond anymore. It does not understand xl shutdown
and needs to be destroyed instead. I tried this a few times and the hex codes remain. Only once, no error was printed out (the Started DRAKVUF loop
was the last message I saw) while the guest also froze.
You also need to update LibVMI
Ok with latest LibVMI from git and Drakvuf recompiled against it, I tried it right away without restarting the guest:
DRAKVUF v0.6-3868a26
Starting DRAKVUF initialization
drakvuf_event_fd_add fd=14
size of list=1
regenerating event_fds and fd_info_lookup...
new event_fd i=0 for fd=14
new fd_info_lookup i=0 for fd=14
drakvuf_init: adding event_fd done
Init VMI on domID 6 -> devuanhvm
init_vmi: initializing vmi done
Max GPFN: 0xff001
Max mem set? 0
Physmap populated? 0
Altp2m enabled? 0
Altp2m view X created? 0 with ID 1
Altp2m view R created? 0 with ID 2
Switched Altp2m view to X? 0
VMI_ERROR: xc_hvm_set_mem_access failed with code: -1
*** FAILED TO SET MEMORY TRAP @ PAGE 1044482 ***
Failed to create guard trap for the empty page!
starting close_vmi_drakvuf
close_vmi_drakvuf finished
libdrakvuf initialization failed
Failed to initialize DRAKVUF
Then I shut the guest down and tried again, and everything seems fine now, with debug,
Post mem cb @ 0x48d9844 vCPU 0 altp2m 0
Pre mem cb with vCPU 0 @ 0x48d98a0 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0x48d98a0 vCPU 0 altp2m 0
Pre mem cb with vCPU 0 @ 0x48d98a0 in view 1: r--
Switching to altp2m view 0 on vCPU 0
Post mem cb @ 0x48d98a0 vCPU 0 altp2m 0
Switching altp2m and to singlestep on vcpu 0
reset trap on vCPU 0, switching altp2m 0->1
and without debug,
[SYSCALL] TIME:1528817872.199726 VCPU:0 CR3:0x1c0ee000,"kworker/0:1" UID:0 linux!sys_imageblit
[SYSCALL] TIME:1528817872.403731 VCPU:0 CR3:0x1c0ee000,"kworker/0:1" UID:0 linux!sys_imageblit
[SYSCALL] TIME:1528817872.607729 VCPU:0 CR3:0x1c0ee000,"kworker/0:1" UID:0 linux!sys_imageblit
[SYSCALL] TIME:1528817872.817862 VCPU:0 CR3:0x1c0ee000,"kworker/0:1" UID:0 linux!sys_imageblit
[SYSCALL] TIME:1528817873.015801 VCPU:0 CR3:0x1c0ee000,"kworker/0:1" UID:0 linux!sys_imageblit
[SYSCALL] TIME:1528817873.031084 VCPU:0 CR3:0x1c0ee000,"init" UID:0 linux!sys_newstat
[SYSCALL] TIME:1528817873.031159 VCPU:0 CR3:0x1c0ee000,"init" UID:0 linux!sys_newfstat
The guest does not crash anymore.
When running Drakvuf against an HVM Linux guest, I can see a few kernel traces during one second or two, and the HVM guest simply crashes.
then the guest appears in the
xl list
output as:(null) 11 0 1 --pscd 15.4
until I interrupt Drakvuf, so the (null)-named domain finally gets cleaned up.
When looking into the process with
strace
I see:Also in
/var/log/xen/xenstored-access.log
I get quite a few entries that would be too large to copy/paste. To get the idea,The HVM guest configuration as follows.
How to run Drakvuf in debug mode? Any idea why the guest is crashing? I tried with LVM against a loop device and I got the same result. Is LVM mandatory? If so I would have to try against a real PV and not a looped device.
Thank you