open-power / skiboot

OPAL boot and runtime firmware for POWER
Apache License 2.0
100 stars 134 forks source link

BT communication err cause kernel crash #262

Closed lili-lilili closed 3 years ago

lili-lilili commented 3 years ago

Openpower 9 machine abnormal restart with some error.

We can see some error with BT/IPMI communication. The customer runs some monitoring software under the OS, which will send a large number of ipmi commands.

We have encountered the problem 2 times,but have no idea why the error occurred。

Here is opal log:

[29006078.471575200,3] BT: seq 0xed netfn 0x0a cmd 0x23: Timeout sending message [29006079.426676437,3] BT: seq 0xfb netfn 0x0a cmd 0x23: Invalid resp_len 108 [29006079.448373690,3] BT: seq 0x00 netfn 0x0a cmd 0x23: Invalid resp_len 105 [29006079.453777603,3] IPMI: Incorrect netfn 0x1c in response [29006093.246448202,3] BT: seq 0x59 netfn 0x0a cmd 0x23: Timeout sending message [29006099.153962282,3] BT: seq 0x64 netfn 0x0a cmd 0x23: Invalid resp_len 108 [29006099.158819798,3] BT: seq 0x65 netfn 0x0a cmd 0x23: Invalid resp_len 108 [29006114.163785853,3] BT: seq 0x81 netfn 0x0a cmd 0x23: Timeout sending message [29006114.288029290,3] BT: seq 0x81 netfn 0x0b cmd 0x23: Timeout sending message [29006114.288917798,3] IPMI: Incorrect netfn 0x0b in response [29006114.288921996,7] Spent 244 msecs in OPAL call 9! [29005889.871885] Watchdog CPU:113 Hard LOCKUP [29005889.874221] Watchdog CPU:113 became unstuck [29006129.487324022,3] IPMI: Incorrect netfn 0x0b in response [29006129.506356988,5] RESET: Fast reboot disabled: Kernel re-entered OPAL [29006129.506418037,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=0805 cpu @0x35c28000 -> pir=0805 token=8 [29006129.509458540,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=0805 cpu @0x35c28000 -> pir=0805 token=8 [29005896.606025] Disabling lock debugging due to kernel taint [29005896.606030] Severe Machine check interrupt [Not recovered] [2900[29006130.120095831,0] OPAL: Reboot requested due to Platform error.

[29005896.606039] Initiator: CPU [29005896.606041] Error type: Real address [Load/Store (foreign)] [29005896.606044] opal: Hardware platform error: Unrecoverable Machine Check exception [29005896.606050] CPU: 69 PID: 0 Comm: swapper/69 Kdump: loaded Tainted: G M ------------ 4.14.0-49.13.1.el7a.ppc64le #1 [29005896.606052] task: c00020397b4b6a00 task.stack: c000203981168000 [29005896.606053] NIP: 000000003002ed90 LR: 00000000300330e0 CTR: 0000000000000000 [29005896.606055] REGS: c000000007c5fd80 TRAP: 0200 Tainted: G M ------------ (4.14.0-49.13.1.el7a.ppc64le) [29005896.606056] MSR: 9000000000201002 <SF,HV,ME,RI> CR: 44002884 XER: 20040000 [29005896.606061] CFAR: 000000003002ed5c DAR: 0034c30822a161ac DSISR: 00000008 SOFTE: 0 [29005896.606061] GPR00: 00000000300330e0 0000000035c2bc10 000000003010ef00 00000000307f38a0 [29005896.606061] GPR04: 00000000300cbc57 0000000035c2bcc0 000000000000000b 0034c30825a45f65 [29005896.606061] GPR08: 000000003030ef00 0034c30822a161ac 0000000000000027 0000000000000001 [29005896.606061] GPR12: 0000000024002882 0000000035c28000 0000000000000800 0000000000200042 [29005896.606061] GPR16: 00000001ace2079f c000203fff484000 0000000000000000 c000000001066e00 [29005896.606061] GPR20: c0000000014f3b00 0000000000000001 0000000000000002 000000000000003f [29005896.606061] GPR24: c000203974ea59b8 0000000000000001 c000203fff4877e0 00000000303149e0 [29005896.606061] GPR28: 000000003016e7f8 c000003fcd17d218 0000000000000000 00000000307f38a0 [29005896.606084] NIP [000000003002ed90] 0x3002ed90 [29005896.606086] LR [00000000300330e0] 0x300330e0 [29005896.606086] Call Trace: [29005896.606087] Instruction dump: [29005896.606091] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX [29005896.606094] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX

lili-lilili commented 3 years ago

I am very surprised why the following two logs will appear:

[29006114.163785853,3] BT: seq 0x81 netfn 0x0a cmd 0x23: Timeout sending message [29006114.288029290,3] BT: seq 0x81 netfn 0x0b cmd 0x23: Timeout sending message [29006114.288917798,3] IPMI: Incorrect netfn 0x0b in response

  1. This 2 msg with a same seq number。
  2. I think the second cmd “netfn 0x0b” does not appear to be sent by the host 。
lili-lilili commented 3 years ago

The problem is recursed easily: During ipmi communication, BMC responds to host at the timeout point

I think this is the root cause: When BT IRQ preempt BT timer, the infight_bt_msg did not protected by bt.lock very well.

fix: https://github.com/lili-lilili/skiboot/commit/6064a5351dd647880d556e5e2e756508a699abb4

hegdevasant commented 3 years ago

Is there a way to reproduce this issue?

Can you attach full msglog and syslog output?

-Vasant

hegdevasant commented 3 years ago

Looking into the code and IPMI spec

Looks like 2 CPUS tried to enter bt_expire_old_msg function. This function is protected by bt.lock.. but in bt_msg_del we do unlock -> ipmi_cmd_done -> lock

[29006114.163785853,3] BT: seq 0x81 netfn 0x0a cmd 0x23: Timeout sending message

First CPU1 printed this and called ipmi_cmd_done .. which updated netfn = 0x0b. At this stage inflight_bt_msg is still valid.

[Note that we sent netfn = 0x0a to BMC and got response as 0x0b and updated message accordingly -- see ipmi_cmd_done).

[29006114.288029290,3] BT: seq 0x81 netfn 0x0b cmd 0x23: Timeout sending message

CPU2 entered bt_expire_old_msg function while CPU1 is executing ipmi_cmd_done.. That means inflight_bt_msg is still valid. It logged above message. Hence we saw two entry with same seq number.

[29006114.288917798,3] IPMI: Incorrect netfn 0x0b in response

This is printed by CPU2.

Your patch looks good to me. I have few minor comments (see my response in mailing list).

-Vasant

lili-lilili commented 3 years ago

fixed: https://github.com/open-power/skiboot/commit/46d7eafbda4006b9b858b49f9df9c63575582a92