refutationalist / saur

Sam's AUR -- personal Arch packages
4 stars 5 forks source link

xen: test machine panics xen kernel on reboot #26

Closed refutationalist closed 2 months ago

refutationalist commented 3 months ago

This is an HP Z840 with two E5-2670v3s. It's been seen both while domUs are running and when not. It may be upstream related.

[175067.278634] reboot: Restarting system
(XEN) Hardware Dom0 shutdown: rebooting machine
(XEN) ----[ Xen-4.18.2-pre-arch  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<00000000ca5ec780>] 00000000ca5ec780
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d0v0)
(XEN) rax: 00000000000000a0   rbx: 0000000000000004   rcx: 0000000000000050
(XEN) rdx: 0000000000000001   rsi: 0000000000000000   rdi: 0000000000000003
(XEN) rbp: ffff83102fff7ce8   rsp: ffff83102fff7cc0   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000832   r11: 0000000000000835
(XEN) r12: 0000000000000000   r13: ffff830000000472   r14: 000000000000000a
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 0000000000172660
(XEN) cr3: 000000102ffd1000   cr2: 00000000ff610000
(XEN) fsb: 00007abc8d6020c0   gsb: ffff888135a00000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen code around <00000000ca5ec780> (00000000ca5ec780):
(XEN)  05 00 00 45 3b c4 75 17 <a1> 00 00 61 ff 00 00 00 00 44 8b c0 41 83 e0 f0
(XEN) Xen stack trace from rsp=ffff83102fff7cc0:
(XEN)    0000000000000000 0000000000000000 0000000000000002 ffff831023900000
(XEN)    0000000000000001 0100005000011c00 0000030000a00001 00a08300038000a0
(XEN)    038000a000000280 ff041c0000a00b02 00a0000002000050 0100005000031c00
(XEN)    0000000000a00000 ffff83102fff7d80 0000000000000065 ffff83102fff7dd8
(XEN)    ffff830000000472 ffff83102fff7d80 000000102ffd1000 ffff82d04029bbbc
(XEN)    ffff82d040332400 0000000000000000 ffff83102fff7dc0 0000000000000000
(XEN)    00000010129d4000 0000000000000000 000000000000000a 0000000000000046
(XEN)    ffff82d0403524d6 ffff82d0403525d5 0000000000000000 0000000000000000
(XEN)    ffff831023266000 ffff82d040351c01 000000002fff7e20 000083102fff7de0
(XEN)    0000000000000000 0000000000000001 ffff831023266000 0000000000000001
(XEN)    ffff8310232661f8 ffffc9004003bc50 0000000000000000 ffff82d04022be53
(XEN)    ffff82d040208027 fffffffffffffff2 000000000000001d ffff831023254000
(XEN)    ffffc9004003bda4 ffff82d04024d66f 0000000140201242 ffff82d040201248
(XEN)    ffff82d040201242 ffff82d040201248 ffff83102fff7ef8 000000000000001d
(XEN)    ffff82d0403176ca ffff82d040201248 ffff82d040201242 ffff82d040201248
(XEN)    ffff82d040201242 ffff82d040201248 ffff82d040201242 ffff82d040201248
(XEN)    ffff831023254000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff82d0402012b7 00000000fee1dead
(XEN)    ffffffff82e50ee0 0000000000000000 ffffc9004003bdc0 0000000028121969
(XEN)    0000000000000004 0000000000000246 ffffc9004003bc50 ffffc9004003bc58
(XEN) Xen call trace:
(XEN)    [<00000000ca5ec780>] R 00000000ca5ec780
(XEN)    [<0000000000000000>] S 0000000000000000
(XEN)    [<ffff82d04029bbbc>] S efi_reset_system+0x4c/0x90
(XEN)    [<ffff82d040332400>] S io_apic.c#clear_IO_APIC_pin+0x10/0x110
(XEN)    [<ffff82d0403524d6>] S __stop_this_cpu+0x16/0x30
(XEN)    [<ffff82d0403525d5>] S smp_send_stop+0xc5/0xe0
(XEN)    [<ffff82d040351c01>] S machine_restart+0x161/0x290
(XEN)    [<ffff82d04022be53>] S hwdom_shutdown+0x53/0xc0
(XEN)    [<ffff82d040208027>] S domain.c#domain_shutdown.part.0+0x47/0x110
(XEN)    [<ffff82d04024d66f>] S do_sched_op+0x38f/0x520
(XEN)    [<ffff82d040201248>] S lstar_enter+0xc8/0x140
(XEN)    [<ffff82d040201242>] S lstar_enter+0xc2/0x140
(XEN)    [<ffff82d040201248>] S lstar_enter+0xc8/0x140
(XEN)    [<ffff82d0403176ca>] S pv_hypercall+0x4ea/0x580
(XEN)    [<ffff82d040201248>] S lstar_enter+0xc8/0x140
(XEN)    [<ffff82d040201242>] S lstar_enter+0xc2/0x140
(XEN)    [<ffff82d040201248>] S lstar_enter+0xc8/0x140
(XEN)    [<ffff82d040201242>] S lstar_enter+0xc2/0x140
(XEN)    [<ffff82d040201248>] S lstar_enter+0xc8/0x140
(XEN)    [<ffff82d040201242>] S lstar_enter+0xc2/0x140
(XEN)    [<ffff82d040201248>] S lstar_enter+0xc8/0x140
(XEN)    [<ffff82d0402012b7>] S lstar_enter+0x137/0x140
(XEN) 
(XEN) Pagetable walk from 00000000ff610000:
(XEN)  L4[0x000] = 000000102ffd0063 ffffffffffffffff
(XEN)  L3[0x003] = 00000000c1a0f063 ffffffffffffffff
(XEN)  L2[0x1fb] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 00000000ff610000
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...
refutationalist commented 3 months ago

I'm going to create a really minimal xen to test it with and see if it's something we're doing or not.

refutationalist commented 3 months ago

It's also worth noting that these machine, when booted into a PVH dom0, only see the PCI-E lanes of the first CPU. May be completely unrelated.

refutationalist commented 3 months ago

The situation as of https://github.com/refutationalist/saur/pull/28

[  140.772805] reboot: Restarting system
(XEN) Hardware Dom0 shutdown: rebooting machine
(XEN) ----[ Xen-4.18.2-arch  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<00000000ca5ec780>] 00000000ca5ec780
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d0v0)
(XEN) rax: 00000000000000a0   rbx: 0000000000000004   rcx: 0000000000000050
(XEN) rdx: 0000000000000001   rsi: 0000000000000000   rdi: 0000000000000003
(XEN) rbp: ffff83102fff7ce8   rsp: ffff83102fff7cc0   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000832   r11: 0000000000000835
(XEN) r12: 0000000000000000   r13: ffff830000000472   r14: 000000000000000a
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 0000000000172660
(XEN) cr3: 000000102ffd1000   cr2: 00000000ff610000
(XEN) fsb: 00007bad0a06d0c0   gsb: ffff888135a00000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen code around <00000000ca5ec780> (00000000ca5ec780):
(XEN)  05 00 00 45 3b c4 75 17 <a1> 00 00 61 ff 00 00 00 00 44 8b c0 41 83 e0 f0
(XEN) Xen stack trace from rsp=ffff83102fff7cc0:
(XEN)    0000000000000000 0000000000000000 0000000000000002 ffff831023900000
(XEN)    0000000000000001 0100005000011c00 0000030000a00001 00a08300038000a0
(XEN)    038000a000000280 ff041c0000a00f02 00a0000002000050 0100005000031c00
(XEN)    0000000000a00000 ffff83102fff7d80 0000000000000065 ffff83102fff7dd8
(XEN)    ffff830000000472 ffff83102fff7d80 000000102ffd1000 ffff82d04029b9fc
(XEN)    ffff82d040332800 0000000000000000 ffff83102fff7dc0 0000000000000000
(XEN)    000000100a494000 0000000000000000 000000000000000a 0000000000000046
(XEN)    ffff82d040352a56 ffff82d040352b55 0000000000000000 0000000000000000
(XEN)    ffff831023266000 ffff82d040352181 000000002fff7e20 000083102fff7de0
(XEN)    0000000000000000 0000000000000001 ffff831023266000 0000000000000001
(XEN)    ffff8310232661f8 ffffc9004003bb78 0000000000000000 ffff82d04022be53
(XEN)    ffff82d040208027 fffffffffffffff2 000000000000001d ffff831023254000
(XEN)    ffffc9004003bccc ffff82d04024d66f 0000000140201247 ffff82d04020124d
(XEN)    ffff82d040201247 ffff82d04020124d ffff83102fff7ef8 000000000000001d
(XEN)    ffff82d040317aba ffff82d04020124d ffff82d040201247 ffff82d04020124d
(XEN)    ffff82d040201247 ffff82d04020124d ffff82d040201247 ffff82d04020124d
(XEN)    ffff831023254000 0000000000000000 0000000000000000 0000000000000000
(XEN)    ffff83102fff7fff 0000000000000000 ffff82d0402012c1 00000000fee1dead
(XEN)    ffffffff82e50ee0 0000000000000000 ffffc9004003bce8 0000000028121969
(XEN)    0000000000000004 0000000000000246 ffffc9004003bb78 ffffc9004003bb80
(XEN) Xen call trace:
(XEN)    [<00000000ca5ec780>] R 00000000ca5ec780
(XEN)    [<0000000000000000>] S 0000000000000000
(XEN)    [<ffff82d04029b9fc>] S efi_reset_system+0x4c/0x90
(XEN)    [<ffff82d040332800>] S io_apic.c#clear_IO_APIC_pin+0/0x110
(XEN)    [<ffff82d040352a56>] S __stop_this_cpu+0x16/0x30
(XEN)    [<ffff82d040352b55>] S smp_send_stop+0xc5/0xe0
(XEN)    [<ffff82d040352181>] S machine_restart+0x161/0x290
(XEN)    [<ffff82d04022be53>] S hwdom_shutdown+0x53/0xc0
(XEN)    [<ffff82d040208027>] S domain.c#domain_shutdown.part.0+0x47/0x110
(XEN)    [<ffff82d04024d66f>] S do_sched_op+0x38f/0x520
(XEN)    [<ffff82d04020124d>] S lstar_enter+0xcd/0x150
(XEN)    [<ffff82d040201247>] S lstar_enter+0xc7/0x150
(XEN)    [<ffff82d04020124d>] S lstar_enter+0xcd/0x150
(XEN)    [<ffff82d040317aba>] S pv_hypercall+0x4ea/0x580
(XEN)    [<ffff82d04020124d>] S lstar_enter+0xcd/0x150
(XEN)    [<ffff82d040201247>] S lstar_enter+0xc7/0x150
(XEN)    [<ffff82d04020124d>] S lstar_enter+0xcd/0x150
(XEN)    [<ffff82d040201247>] S lstar_enter+0xc7/0x150
(XEN)    [<ffff82d04020124d>] S lstar_enter+0xcd/0x150
(XEN)    [<ffff82d040201247>] S lstar_enter+0xc7/0x150
(XEN)    [<ffff82d04020124d>] S lstar_enter+0xcd/0x150
(XEN)    [<ffff82d0402012c1>] S lstar_enter+0x141/0x150
(XEN) 
(XEN) Pagetable walk from 00000000ff610000:
(XEN)  L4[0x000] = 000000102ffd0063 ffffffffffffffff
(XEN)  L3[0x003] = 00000000c1a0f063 ffffffffffffffff
(XEN)  L2[0x1fb] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 00000000ff610000
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
refutationalist commented 2 months ago

reboot=acpi fixes it. I should spend some time with the documentation some day.