Open pridhiviraj opened 6 years ago
16.23569|================================================ 16.23993|Error reported by prdf (0xE500) PLID 0x9000003C 16.23993| PRD Signature : 0x7000E 0xDD3F000E 16.24269| Signature Description : pu.core:k0:n0:s0:p00:c14 (COREFIR[14]) Machine check and ME = 0 Err 16.24429| UserData1 : 0x0007000e00000101 16.24430| UserData2 : 0xdd3f000e00000000 16.24430|------------------------------------------------
This means we got hit two consecutive MCEs during very small window where MCE handler was handling first MCE error. It is expected that we get system xstop when this happens.
We should try following upstream commit in the petitboot kernel:
https://git.kernel.org/powerpc/c/75ecfb49516c53da00c57b9efe48fa
@maheshsal We have observed that the above patch is already in petitboot kernel.
This is not the right place for petitkernel bugs.
I think you should file them here: https://github.com/open-power/op-build/issues
@mpe As @mikey suggested this is a kind of upstream kernel issue with nr_cpus(less tested till now), even though it re-created with petitboot kernel. That's why i opened the issue here.
OK that's not very clear from the bug :)
yeah I figured this was likely an upstream bug rather than petitboot specific.
With non-SMT nr_cpus I see this with petitboot kernel:
[ 2.826369] Faulting instruction address: 0xc000000000022148 cpu 0x0: Vector: 300 (Data Access) at [c00000000111f7d0] pc: c000000000022148: lwarx_loop_stop+0x0/0x24 lr: c00000000004e1d4: power9_idle_type+0x4c/0x70 sp: c00000000111fa50 msr: 9000000000001033 dar: 0 dsisr: 80000 current = 0xc0000000010e9e80 paca = 0xc00000000fff0000 softe: 3 irq_happened: 0x01 pid = 0, comm = swapper/0 Linux version 4.16.7-openpower1 (jenkins@jenkins-vm) (gcc version 6.4.0 (Buildroot 2018.02.1-00006-ga8d1126)) #1 SMP Tue May 8 17:43:36 UTC 2018 enter ? for help [c00000000111fd40] c00000000004e1d4 power9_idle_type+0x4c/0x70 [c00000000111fd80] c000000000524ee8 stop_loop+0x38/0x48 [c00000000111fdb0] c000000000523058 cpuidle_enter_state+0x14c/0x20c [c00000000111fe00] c0000000000acf40 call_cpuidle+0x6c/0x74 [c00000000111fe20] c0000000000ad1e4 do_idle+0x1ec/0x200 [c00000000111fea0] c0000000000ad384 cpu_startup_entry+0x30/0x34 [c00000000111fed0] c00000000000cf8c rest_init+0xd8/0xe4 [c00000000111ff00] c000000001003d00 start_kernel+0x4fc/0x504 [c00000000111ff90] c00000000000ac70 start_here_common+0x1c/0x4ac 0:mon>
also on upstream kernel:
[ 2.101374] Unable to handle kernel paging request for data at address 0x00000000 [ 2.101413] Unable to handle kernel paging request for data at address 0x00000000 [ 2.101436] Faulting instruction address: 0xc00000000003442c [ 2.101487] Faulting instruction address: 0xc00000000003442c [ 2.101489] Oops: Kernel access of bad area, sig: 7 [#1] [ 2.101605] LE SMP NR_CPUS=2048 NUMA PowerNV [ 2.101652] Modules linked in: [ 2.101690] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.17.0-rc5mahesh #3 [ 2.101779] NIP: c00000000003442c LR: c0000000000a41f0 CTR: c0000000000343e8 [ 2.101858] REGS: c0000000016b77a0 TRAP: 0300 Not tainted (4.17.0-rc5mahesh) [ 2.101909] MSR: 9000000000001033 <SF,HV,ME,IR,DR,RI,LE> CR: 44000284 XER: 20040000 [ 2.101966] CFAR: c00000000003441c DAR: 0000000000000000 DSISR: 00080000 SOFTE: 1 [ 2.101966] GPR00: 0000000000000000 c0000000016b7a20 c0000000016b8b00 0000000000000005 [ 2.101966] GPR04: 0000000000000004 c0000000015356d0 0000000000000000 0000000000000000 [ 2.101966] GPR08: 0000000000000000 0000000000000000 c0000000016b4000 c000007fdee175a0 [ 2.101966] GPR12: c000000000ad2930 c000000001a30000 0000000000000000 0000000000000000 [ 2.101966] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2.101966] GPR20: 0000000000000000 0000000000000001 0000000010004d9c 00000000100053ed [ 2.101966] GPR24: 0000000000000008 0000000000000008 0000000000000000 0000000000000008 [ 2.101966] GPR28: c0000000015ece30 00000000003003ff 0000000000300375 0000000000300375 [ 2.102464] NIP [c00000000003442c] lwarx_loop_stop+0x0/0x24 [ 2.102500] LR [c0000000000a41f0] __power9_idle_type+0x80/0xb0 [ 2.102543] Call Trace: [ 2.102563] [c0000000016b7a20] [c0000000016b7a70] init_stack+0x3a70/0x4000 (unreliable) [ 2.102616] [c0000000016b7d10] [c0000000000a41f0] __power9_idle_type+0x80/0xb0 [ 2.102669] [c0000000016b7d60] [c0000000000a4800] power9_idle_type+0x20/0x40 [ 2.102777] [c0000000016b7d80] [c000000000ad2970] stop_loop+0x40/0x5c [ 2.102839] [c0000000016b7db0] [c000000000aced34] cpuidle_enter_state+0xa4/0x400 [ 2.102947] [c0000000016b7e10] [c00000000014961c] call_cpuidle+0x4c/0x90 [ 2.103046] [c0000000016b7e30] [c000000000149c4c] do_idle+0x32c/0x3d0 [ 2.103136] [c0000000016b7ea0] [c000000000149f2c] cpu_startup_entry+0x3c/0x50 [ 2.103243] [c0000000016b7ed0] [c00000000000df90] rest_init+0xe0/0x100 [ 2.103324] [c0000000016b7f00] [c000000001084330] start_kernel+0x614/0x634 [ 2.103414] [c0000000016b7f90] [c00000000000ac7c] start_here_common+0x1c/0x4a0 [ 2.103519] Instruction dump: [ 2.103574] f86d09b8 39800000 4800038c 60000000 60000000 e8a28080 e8850000 7c232000 [ 2.103666] 40800008 4c0002e4 88ed09a9 e9cd09a0 <7de07028> 75e91000 40c2fe35 7def3878
Looks like we are hitting a bug in idle code for non-SMT nr_cpus.. in this case lwarx_loop_stop tries to access NULL pointer paca->core_idle_state. This is because with non-SMT number of nr_cpus the pnv_alloc_idle_core_states() function does not allocate core_idle_state for last core.
.Lhandle_deep_stop:
/*
* Entering deep idle state.
* Clear thread bit in PACA_CORE_IDLE_STATE, save SPRs to
* stack and enter stop
*/
lbz r7,PACA_THREAD_MASK(r13)
ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
lwarx_loop_stop:
lwarx r15,0,r14
andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
bnel- core_idle_lock_held
andc r15,r15,r7 /* Clear thread bit */
stwcx. r15,0,r14
bne- lwarx_loop_stop
isync
e.g. if nr_cpus=3 then nr_cores are set to 0 and the code never enter into for loop. With this when cpu wakes up from deep sleep, lwarx_loop_stop() tries to access pointer paca->core_idle_state and crashes.
static inline int cpu_nr_cores(void)
{
return nr_cpu_ids >> threads_shift;
}
static void pnv_alloc_idle_core_states(void)
{
int i, j;
int nr_cores = cpu_nr_cores();
u32 *core_idle_state;
[...]
for (i = 0; i < nr_cores; i++) {
int first_cpu = i * threads_per_core;
int node = cpu_to_node(first_cpu);
size_t paca_ptr_array_size;
core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
*core_idle_state = (1 << threads_per_core) - 1;
paca_ptr_array_size = (threads_per_core *
sizeof(struct paca_struct *));
for (j = 0; j < threads_per_core; j++) {
int cpu = first_cpu + j;
paca_ptrs[cpu]->core_idle_state_ptr = core_idle_state;
paca_ptrs[cpu]->thread_idle_state = PNV_THREAD_RUNNING;
paca_ptrs[cpu]->thread_mask = 1 << j;
if (!cpu_has_feature(CPU_FTR_POWER9_DD1))
continue;
paca_ptrs[cpu]->thread_sibling_pacas =
kmalloc_node(paca_ptr_array_size,
GFP_KERNEL, node);
}
}
The PowerNV idle code was rewritten since then and it doesn't seem to have this problem, although the smt4 catch/release workaround might access sibling pacas without checking they were allocated. Generally this is a bad configuration for PowerNV because the other SMT threads will be left spinning in skiboot and they can't go to 'stop' state because skiboot doesn't own the 0x100 vector, so the core will be forced into SMT modes even when idle.
Things like the smt4 catch/release workaround wouldn't work either, for the same reason.
We probably should just disallow nr_cpus % nr_threads != 0 for PowerNV for this reason. Actually nr_cpus= in any case is a bit risky because we could get MCEs or HMIs or SRESETs on those CPUs, and they'll be spinning burning more power than they need to. If the PowerNV platform implemented nr_cpus= by starting all secondaries but parking the missing ones in an offline state, it would be more realistic to allow the option.
cpu_nr_cores() is probably not a very well implemented API either. That at least should be taken out of cputhreads.h and moved into kvm internals.
As per @maheshsal suggestion tried same boot test with nr_cpus=4 , we are hitting a different checkstop due to core IF Logic Recovery ERR.