oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
2.94k stars 167 forks source link

Certain host images can cause thermal shutdown #1213

Open mkeeter opened 1 year ago

mkeeter commented 1 year ago

When @andrewjstone was testing a new host image on Rack 2, we noticed that fans spun up and the system shut down.

Looking at the ringbuf, it looks like the usual sensor reading failure, followed by the thermal loop sending the system to A2:

BRM42220051-switch # humility -dhubris.core.3 ringbuf thermal
humility: attached to dump
humility: ring buffer task_thermal::__RINGBUF in thermal:
 NDX LINE      GEN    COUNT PAYLOAD
  30  586       53        1 MiscReadFailed(SensorId(0x1), I2cError(BusResetMux))
  31  611       53        1 SensorReadFailed(SensorId(0x48), I2cError(NoDevice))
   0  611       54        1 SensorReadFailed(SensorId(0x7), I2cError(BusResetMux))
   1  611       54        1 SensorReadFailed(SensorId(0xc), I2cError(BusResetMux))
   2  611       54        1 SensorReadFailed(SensorId(0x11), I2cError(BusResetMux))
   3  611       54        1 SensorReadFailed(SensorId(0x16), I2cError(BusResetMux))
   4  611       54        1 SensorReadFailed(SensorId(0x1b), I2cError(BusResetMux))
   5  611       54        1 SensorReadFailed(SensorId(0x20), I2cError(BusReset))
   6  611       54        1 SensorReadFailed(SensorId(0x25), I2cError(BusResetMux))
   7  611       54        1 SensorReadFailed(SensorId(0x2a), I2cError(BusResetMux))
   8  611       54        1 SensorReadFailed(SensorId(0x2f), I2cError(BusResetMux))
   9  611       54        1 SensorReadFailed(SensorId(0x34), I2cError(BusResetMux))
  10  884       54        1 ControlPwm(0x64)
  11  586       54        1 MiscReadFailed(SensorId(0x0), I2cError(BusResetMux))
  12  586       54        1 MiscReadFailed(SensorId(0x2), I2cError(BusResetMux))
  13  586       54        1 MiscReadFailed(SensorId(0x1), I2cError(BusResetMux))
  14  611       54        1 SensorReadFailed(SensorId(0x48), I2cError(NoDevice))
  15  611       54        1 SensorReadFailed(SensorId(0x7), I2cError(BusResetMux))
  16  611       54        1 SensorReadFailed(SensorId(0xc), I2cError(BusResetMux))
  17  611       54        1 SensorReadFailed(SensorId(0x11), I2cError(BusResetMux))
  18  611       54        1 SensorReadFailed(SensorId(0x16), I2cError(BusResetMux))
  19  611       54        1 SensorReadFailed(SensorId(0x1b), I2cError(BusResetMux))
  20  611       54        1 SensorReadFailed(SensorId(0x20), I2cError(BusReset))
  21  611       54        1 SensorReadFailed(SensorId(0x25), I2cError(BusResetMux))
  22  611       54        1 SensorReadFailed(SensorId(0x2a), I2cError(BusResetMux))
  23  611       54        1 SensorReadFailed(SensorId(0x2f), I2cError(BusResetMux))
  24  611       54        1 SensorReadFailed(SensorId(0x34), I2cError(BusResetMux))
  25  848       54        1 AutoState(Uncontrollable)
  26  678       54        1 PowerModeChanged(PowerBitmask { bits: 0x1 })
  27  555       54        1 AutoState(Boot)
  28  770       54        1 AutoState(Running)
  29  884       54      189 ControlPwm(0x0)

However, there's some extra weirdness in there. This happens within ~60 seconds of booting a particular host image, but works fine with stock images. We also see communication issues with SB-TSI (3H bus) in addition to the usual 2F, which is unusual. Finally, one of the RAM power regulators (VDD_MEM_EFGH) thinks that it's at 247°C and drawing 115A.

BRM42220051-switch # humility --ip fe80::aa40:25ff:fe04:205%gimlet16 -a build-gimlet-c.zip validate
humility: connecting to fe80::aa40:25ff:fe04:205%gimlet16
ID VALIDATION   C P  MUX ADDR DEVICE        DESCRIPTION
 0 error        2 F  -   0x48 tmp117        Southwest temperature sensor
 1 error        2 F  -   0x49 tmp117        South temperature sensor
 2 error        2 F  -   0x4a tmp117        Southeast temperature sensor
 3 error        2 F  -   0x70 pca9545       U.2 ABCD mux
 4 error        2 F  -   0x71 pca9545       U.2 EFGH mux
 5 error        2 F  -   0x72 pca9545       U.2 IJ/FRUID mux
 6 error        2 F  1:1 0x50 at24csw080    U.2 Sharkfin A VPD
 7 error        2 F  1:1 0x38 max5970       U.2 Sharkfin A hot swap controller
 8 error        2 F  1:1 0x6a nvme_bmc      U.2 A NVMe Basic Management Command
 9 error        2 F  1:2 0x50 at24csw080    U.2 Sharkfin B VPD
10 error        2 F  1:2 0x38 max5970       U.2 Sharkfin B hot swap controller
11 error        2 F  1:2 0x6a nvme_bmc      U.2 B NVMe Basic Management Control
12 error        2 F  1:3 0x50 at24csw080    U.2 Sharkfin C VPD
13 error        2 F  1:3 0x38 max5970       U.2 Sharkfin C hot swap controller
14 error        2 F  1:3 0x6a nvme_bmc      U.2 C NVMe Basic Management Control
15 error        2 F  1:4 0x50 at24csw080    U.2 Sharkfin D VPD
16 error        2 F  1:4 0x38 max5970       U.2 Sharkfin D hot swap controller
17 error        2 F  1:4 0x6a nvme_bmc      U.2 D NVMe Basic Management Control
18 error        2 F  2:1 0x50 at24csw080    U.2 Sharkfin E VPD
19 error        2 F  2:1 0x38 max5970       U.2 Sharkfin E hot swap controller
20 error        2 F  2:1 0x6a nvme_bmc      U.2 E NVMe Basic Management Control
21 error        2 F  2:2 0x50 at24csw080    U.2 Sharkfin F VPD
22 error        2 F  2:2 0x38 max5970       U.2 Sharkfin F hot swap controller
23 error        2 F  2:2 0x6a nvme_bmc      U.2 F NVMe Basic Management Control
24 error        2 F  2:3 0x50 at24csw080    U.2 Sharkfin G VPD
25 error        2 F  2:3 0x38 max5970       U.2 Sharkfin G hot swap controller
26 error        2 F  2:3 0x6a nvme_bmc      U.2 G NVMe Basic Management Control
27 error        2 F  2:4 0x50 at24csw080    U.2 Sharkfin H VPD
28 error        2 F  2:4 0x38 max5970       U.2 Sharkfin H hot swap controller
29 error        2 F  2:4 0x6a nvme_bmc      U.2 H NVMe Basic Management Control
30 error        2 F  3:1 0x50 at24csw080    U.2 Sharkfin I VPD
31 error        2 F  3:1 0x38 max5970       U.2 Sharkfin I hot swap controller
32 error        2 F  3:1 0x6a nvme_bmc      U.2 I NVMe Basic Management Control
33 error        2 F  3:2 0x50 at24csw080    U.2 Sharkfin J VPD
34 error        2 F  3:2 0x38 max5970       U.2 Sharkfin J hot swap controller
35 error        2 F  3:2 0x6a nvme_bmc      U.2 J NVMe Basic Management Control
36 error        2 F  3:4 0x50 at24csw080    Gimlet VPD
37 present      2 B  -   0x73 pca9545       M.2 mux
38 unavailable  2 B  1:1 0x6a m2_hp_only    M.2 A NVMe Basic Management Command
39 unavailable  2 B  1:2 0x6a m2_hp_only    M.2 B NVMe Basic Management Command
40 validated    2 B  1:3 0x50 at24csw080    Fan VPD
41 validated    2 B  1:4 0x4c tmp451        T6 temperature sensor
42 validated    3 H  -   0x24 tps546b24a    A2 3.3V rail
43 validated    3 H  -   0x26 tps546b24a    A0 3.3V rail
44 validated    3 H  -   0x27 tps546b24a    A2 5V rail
45 validated    3 H  -   0x29 tps546b24a    A2 1.8V rail
46 validated    3 H  -   0x3a max5970       M.2 hot plug controller
47 absent       3 H  -   0x3c sbrmi         CPU via SB-RMI
48 absent       3 H  -   0x4c sbtsi         CPU temperature sensor
49 present      3 H  -   0x58 idt8a34003    Clock generator
50 validated    3 H  -   0x5a raa229618     CPU power controller
51 validated    3 H  -   0x5b raa229618     SoC power controller
52 validated    3 H  -   0x5c isl68224      DIMM/SP3 1.8V A0 power controller
53 validated    4 F  -   0x10 adm1272       Fan hot swap controller
54 validated    4 F  -   0x14 adm1272       Sled hot swap controller
55 validated    4 F  -   0x20 max31790      Fan controller
56 validated    4 F  -   0x25 tps546b24a    T6 power controller
57 validated    4 F  -   0x48 tmp117        Northeast temperature sensor
58 validated    4 F  -   0x49 tmp117        North temperature sensor
59 validated    4 F  -   0x4a tmp117        Northwest temperature sensor
60 validated    4 F  -   0x67 bmr491        Intermediate bus converter
61 validated    3 H  -   0x18 tse2004av     DIMM A0
62 validated    3 H  -   0x19 tse2004av     DIMM A1
63 validated    3 H  -   0x1a tse2004av     DIMM B0
64 validated    3 H  -   0x1b tse2004av     DIMM B1
65 validated    3 H  -   0x1c tse2004av     DIMM C0
66 validated    3 H  -   0x1d tse2004av     DIMM C1
67 validated    3 H  -   0x1e tse2004av     DIMM D0
68 validated    3 H  -   0x1f tse2004av     DIMM D1
69 validated    4 F  -   0x18 tse2004av     DIMM E0
70 validated    4 F  -   0x19 tse2004av     DIMM E1
71 validated    4 F  -   0x1a tse2004av     DIMM F0
72 validated    4 F  -   0x1b tse2004av     DIMM F1
73 validated    4 F  -   0x1c tse2004av     DIMM G0
74 validated    4 F  -   0x1d tse2004av     DIMM G1
75 validated    4 F  -   0x1e tse2004av     DIMM H0
76 validated    4 F  -   0x1f tse2004av     DIMM H1
BRM42220051-switch # humility --ip fe80::aa40:25ff:fe04:205%gimlet16 -a build-gimlet-c.zip sensors
humility: connecting to fe80::aa40:25ff:fe04:205%gimlet16
NAME                 KIND                  VALUE UNPWR   ERR MSSNG UNAVL TMOUT
Southwest            temp                  27.92     0     0     0     0     0
South                temp                  28.60     0     0     0     0     0
Southeast            temp                  27.01     0     0     0     0     0
V12_U2A_A0           current                0.55     0     0     0     0     0
V3P3_U2A_A0          current                0.08     0     0     0     0     0
V12_U2A_A0           voltage               12.09     0     0     0     0     0
V3P3_U2A_A0          voltage                3.33     0     0     0     0     0
U2_N0                temp                  32.00     0     0   63+     0     0
V12_U2B_A0           current                0.52     0     0     0     0     0
V3P3_U2B_A0          current                0.00     0     0     0     0     0
V12_U2B_A0           voltage               12.12     0     0     0     0     0
V3P3_U2B_A0          voltage                3.34     0     0     0     0     0
U2_N1                temp                  32.00     0     0   63+     0     0
V12_U2C_A0           current                0.59     0     0     0     0     0
V3P3_U2C_A0          current                0.00     0     0     0     0     0
V12_U2C_A0           voltage               12.08     0     0     0     0     0
V3P3_U2C_A0          voltage                3.33     0     0     0     0     0
U2_N2                temp                  32.00     0     0   63+     0     0
V12_U2D_A0           current                0.46     0     0     0     0     0
V3P3_U2D_A0          current                0.02     0     0     0     0     0
V12_U2D_A0           voltage               12.08     0     0     0     0     0
V3P3_U2D_A0          voltage                3.34     0     0     0     0     0
U2_N3                temp                  32.00     0     0   63+     0     0
V12_U2E_A0           current                0.45     0     0     0     0     0
V3P3_U2E_A0          current                0.01     0     0     0     0     0
V12_U2E_A0           voltage               12.06     0     0     0     0     0
V3P3_U2E_A0          voltage                3.33     0     0     0     0     0
U2_N4                temp                  32.00     0     0   63+     0     0
V12_U2F_A0           current                0.47     0     0     0     0     0
V3P3_U2F_A0          current                0.00     0     0     0     0     0
V12_U2F_A0           voltage               12.06     0     0     0     0     0
V3P3_U2F_A0          voltage                3.34     0     0     0     0     0
U2_N5                temp                  32.00     0     0   63+     0     0
V12_U2G_A0           current                0.49     0     0     0     0     0
V3P3_U2G_A0          current                0.04     0     0     0     0     0
V12_U2G_A0           voltage               12.11     0     0     0     0     0
V3P3_U2G_A0          voltage                3.34     0     0     0     0     0
U2_N6                temp                  32.00     0     0   63+     0     0
V12_U2H_A0           current                0.50     0     0     0     0     0
V3P3_U2H_A0          current                0.01     0     0     0     0     0
V12_U2H_A0           voltage               12.05     0     0     0     0     0
V3P3_U2H_A0          voltage                3.33     0     0     0     0     0
U2_N7                temp                  32.00     0     0   63+     0     0
V12_U2I_A0           current                0.47     0     0     0     0     0
V3P3_U2I_A0          current                0.00     0     0     0     0     0
V12_U2I_A0           voltage               12.08     0     0     0     0     0
V3P3_U2I_A0          voltage                3.34     0     0     0     0     0
U2_N8                temp                  32.00     0     0   63+     0     0
V12_U2J_A0           current                0.51     0     0     0     0     0
V3P3_U2J_A0          current                0.03     0     0     0     0     0
V12_U2J_A0           voltage               12.08     0     0     0     0     0
V3P3_U2J_A0          voltage                3.34     0     0     0     0     0
U2_N9                temp                  32.00     0     0   63+     0     0
M2_A                 temp                  37.00   63+     1     1     0     0
M2_B                 temp                  34.00   63+     1     1     0     0
t6                   temp                  54.31     0     0     0     0     0
V3P3_SP_A2           temp                  37.00     0     0     0     0     0
V3P3_SP_A2           current                0.27     0     0     0     0     0
V3P3_SP_A2           voltage                3.31     0     0     0     0     0
V3P3_SYS_A0          temp                  38.75     0     0     0     0     0
V3P3_SYS_A0          current                1.52     0     0     0     0     0
V3P3_SYS_A0          voltage                3.31     0     0     0     0     0
V5_SYS_A2            temp                  38.50     0     0     0     0     0
V5_SYS_A2            current                1.13     0     0     0     0     0
V5_SYS_A2            voltage                4.99     0     0     0     0     0
V1P8_SYS_A2          temp                  42.25     0     0     0     0     0
V1P8_SYS_A2          current                5.48     0     0     0     0     0
V1P8_SYS_A2          voltage                1.79     0     0     0     0     0
V3P3_M2A_A0HP        current                0.70     0     0     0     0     0
V3P3_M2B_A0HP        current                0.78     0     0     0     0     0
V3P3_M2A_A0HP        voltage                3.33     0     0     0     0     0
V3P3_M2B_A0HP        voltage                3.33     0     0     0     0     0
CPU                  temp                  60.00     0     0     0     0     0
VDD_VCORE            temp                  41.00     0     0     0     0     0
VDD_MEM_ABCD         temp                  42.00     0     0     0     0     0
VDD_VCORE            power                     -     0     0     0     0     0
VDD_MEM_ABCD         power                     -     0     0     0     0     0
VDD_VCORE            current               29.40     0     0     0     0     0
VDD_MEM_ABCD         current               29.10     0     0     0     0     0
VDD_VCORE            voltage                1.18     0     0     0     0     0
VDD_MEM_ABCD         voltage                1.22     0     0     0     0     0
VDDCR_SOC            temp                  50.00     0     0     0     0     0
VDD_MEM_EFGH         temp                 247.00     0     0     0     0     0
VDDCR_SOC            power                     -     0     0     0     0     0
VDD_MEM_EFGH         power                     -     0     0     0     0     0
VDDCR_SOC            current               20.80     0     0     0     0     0
VDD_MEM_EFGH         current              115.70     0     0     0     0     0
VDDCR_SOC            voltage                0.89     0     0     0     0     0
VDD_MEM_EFGH         voltage                1.21     0     0     0     0     0
VPP_ABCD             current                0.40     0     0     0     0     0
VPP_EFGH             current                0.40     0     0     0     0     0
V1P8_SP3             current                1.60     0     0     0     0     0
VPP_ABCD             voltage                2.50     0     0     0     0     0
VPP_EFGH             voltage                2.50     0     0     0     0     0
V1P8_SP3             voltage                1.80     0     0     0     0     0
V54_FAN              temp                  36.88     0     0     0     0     0
V54_FAN              current                0.10     0     0     0     0     0
V54_FAN              voltage               54.48     0     0     0     0     0
V54_HS_OUTPUT        temp                  33.55     0     0     0     0     0
V54_HS_OUTPUT        current                4.72     0     0     0     0     0
V54_HS_OUTPUT        voltage               54.46     0     0     0     0     0
Southeast            speed               1938.00     0     0     0     0     0
Northeast            speed               1840.00     0     0     0     0     0
South                speed               1946.00     0     0     0     0     0
North                speed               1830.00     0     0     0     0     0
Southwest            speed               1958.00     0     0     0     0     0
Northwest            speed               1837.00     0     0     0     0     0
V0P96_NIC_VDD_A0HP   temp                  44.00     0     0     0     0     0
V0P96_NIC_VDD_A0HP   current                6.63     0     0     0     0     0
V0P96_NIC_VDD_A0HP   voltage                0.96     0     0     0     0     0
Northeast            temp                  31.65     0     0     0     0     0
North                temp                  36.28     0     0     0     0     0
Northwest            temp                  30.82     0     0     0     0     0
V12_SYS_A2           temp                  44.00     0     0     0     0     0
V12_SYS_A2           power                     -     0     0     0     0     0
V12_SYS_A2           current               20.25     0     0     0     0     0
V12_SYS_A2           voltage               11.99     0     0     0     0     0
DIMM_A0              temp                  36.50     0     0     0     0     0
DIMM_A1              temp                  36.75     0     0     0     0     0
DIMM_B0              temp                  36.25     0     0     0     0     0
DIMM_B1              temp                  35.75     0     0     0     0     0
DIMM_C0              temp                  35.50     0     0     0     0     0
DIMM_C1              temp                  35.25     0     0     0     0     0
DIMM_D0              temp                  34.75     0     0     0     0     0
DIMM_D1              temp                  35.50     0     0     0     0     0
DIMM_E0              temp                  35.75     0     0     0     0     0
DIMM_E1              temp                  36.25     0     0     0     0     0
DIMM_F0              temp                  35.75     0     0     0     0     0
DIMM_F1              temp                  35.00     0     0     0     0     0
DIMM_G0              temp                  34.75     0     0     0     0     0
DIMM_G1              temp                  34.50     0     0     0     0     0
DIMM_H0              temp                  34.50     0     0     0     0     0
DIMM_H1              temp                  35.25     0     0     0     0     0

None of the failing buses think they have a mux segment selected, which is unusual:

BRM42220051-switch # humility --ip fe80::aa40:25ff:fe04:205%gimlet16 -a build-gimlet-c.zip hiffy -c Validate.selected_mux_segment -aindex=15
humility: connecting to fe80::aa40:25ff:fe04:205%gimlet16
Validate.selected_mux_segment() => None
BRM42220051-switch # humility --ip fe80::aa40:25ff:fe04:205%gimlet16 -a build-gimlet-c.zip hiffy -c Validate.selected_mux_segment -aindex=48
humility: connecting to fe80::aa40:25ff:fe04:205%gimlet16
Validate.selected_mux_segment() => None

Relevant files:

mkeeter commented 1 year ago

There's nothing in LAST_HOST_PANIC or LAST_HOST_BOOT_FAIL, and the host thinks that it made it all the way to console login:

BRM42220051-switch # pilot sp console 16
attaching to console; to detach, press: Ctrl-A, Ctrl-X...
Mar 16 19:14:35.259 INFO creating SP handle on interface gimlet16, component: faux-mgs
                                                                                      Mar 16 19:14:35.260 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:205%38]:11111, interface: gimlet16, component: faux-mgs

Oxide Pico Host Boot Loader
Config {
    cons:   Uart(fedc9000),
    loader: 0x7f509000..0x7fff0000
    pageroot: P4KA(0x7ff6f000),
}
Decompressing cpio archive to 0x77509000..0x7f509000...Done.
jumping to kernel entry at 0xfffffffffbc27730
Configured AGPIO139: 50300 (input is high)

-----------> Sending IPCC command 0x8, attempt 1/10
Received empty frame
Additional data length: 0x10

-----------> Sending IPCC command 0x3, attempt 1/10
Received empty frame
Additional data length: 0x1

-----------> Sending IPCC command 0x4, attempt 1/10
Received empty frame
Additional data length: 0x6a
Loading kmdb...
NOTICE: Socket 0 SMU Version: 45.63.0
NOTICE: Socket 0 DXIO Version: 45.679
NOTICE: Socket 0 SMU features 0x0690fbfd enabled
cpu0: microcode has been updated from version 0x0 to 0xa0011ce
Oxide Helios Version stlouis-0-g65c574a774 64-bit (onu)
WARNING: Socket 0 SM 0x0->0xf
WARNING: XXX skipping a ton of mapped stuff
NOTICE: Finished writing PCIe straps.
WARNING: Socket 0 SM 0xf->0x5
WARNING: XXX skipping a ton of configured stuff
WARNING: Socket 0 SM 0x5->0x8
WARNING: let's go deasserting: 1, 1
WARNING: Socket 0 SM 0x8->0xd
WARNING: we're out of here
NOTICE: DXIO devices successfully trained?
NOTICE: mapped entry 0 to port fffffffffbe26c60
NOTICE: mapped entry 1 to port fffffffffbe27c20
NOTICE: mapped entry 2 to port fffffffffbe27618
NOTICE: mapped entry 3 to port fffffffffbe264b0
NOTICE: mapped entry 4 to port fffffffffbe264e0
NOTICE: mapped entry 5 to port fffffffffbe26510
NOTICE: mapped entry 6 to port fffffffffbe26540
NOTICE: mapped entry 7 to port fffffffffbe27df8
NOTICE: mapped entry 8 to port fffffffffbe27dc8
NOTICE: mapped entry 9 to port fffffffffbe27d98
NOTICE: mapped entry 10 to port fffffffffbe27470
NOTICE: mapped entry 11 to port fffffffffbe27440
NOTICE: mapped entry 12 to port fffffffffbe27410
NOTICE: mapped entry 13 to port fffffffffbe26688
in oxide_boot! oxb=fffffcf93071e380
    cpio wants: bfd6a11989a1142944c4191b52b64ebe988a183a981ff8abae182f6c2e96a600
attaching stuff...
FCH peripheral: dwu@0, dwu0
FCH peripheral: dwu@1, dwu1
FCH peripheral: dwu@2, dwu2
FCH peripheral: dwu@3, dwu3
TRYING: boot disk (slot 18, slice 0)
NVMe boot devices:
    blkdev0 (slot 17)
    blkdev8 (slot 6)
    blkdev7 (slot 5)
    blkdev3 (slot 4)
    blkdev9 (slot 9)
    blkdev10 (slot 8)
    blkdev2 (slot 7)
    /pci@38,0/pci1022,1483@3,3/pci1344,3100@0/blkdev@w00A075013280BCB0,0:a (slot 18!)

found M.2 device (slot 18, slice 0), @ /pci@38,0/pci1022,1483@3,3/pci1344,3100@0/blkdev@w00A075013280BCB0,0:a
opening M.2 device
    in image: bfd6a11989a1142944c4191b52b64ebe988a183a981ff8abae182f6c2e96a600
opening ramdisk control device
creating ramdisk of size 4294967296
opening ramdisk device: /devices/pseudo/ramdisk@1024:rpool
closing M.2
ramdisk data size = 838860800
checksum ok!
strplumb: failed to initialize drv/ip
Configuring devices.
WARNING: ext_ip_hack disabled: traffic will be encapsulated
Hostname: BRM42220067
Dec 28 00:00:07 BRM42220067 zpool[639]: SMF initialization problem: entity not found

BRM42220067 console login: Dec 28 00:00:07 BRM42220067 last message repeated 27 times
Dec 28 00:00:32 BRM42220067 fch: FCH peripheral: dwu@1, dwu1
Dec 28 00:00:32 BRM42220067 fch: FCH peripheral: dwu@2, dwu2
Dec 28 00:00:32 BRM42220067 fch: FCH peripheral: dwu@3, dwu3
Dec 28 00:00:32 BRM42220067 xde: WARNING: ext_ip_hack disabled: traffic will be encapsulated
Dec 28 00:00:34 BRM42220067 genunix: WARNING: (pcieb16): failed to attach driver for a device (pci1de,fff9-1) under the Connection pcie16
Dec 28 00:00:34 BRM42220067 last message repeated 3 times
andrewjstone commented 1 year ago

I built the cursed host image using the build-host-image.sh script from https://github.com/oxidecomputer/omicron/pull/2557 following the instructions here, except I built using helios instead of helios-engvm.

I cloned helios and checked it out on master, which matched the helios version shown in omicron/tools/helios_version:

commit 49d501d2f37060e29a84a50e9026860315975794 (HEAD -> master, origin/master, origin/HEAD)
Author: Sean Klein <sean@oxide.computer>
Date:   Wed Mar 8 13:47:39 2023 -0500

    image: increase size of default image for omicron

Following the instructions I generated zone images for omicron:

$ ./.github/buildomat/jobs/package.sh

I then built the standard host images to install on sled 16 (scrimlet) in rack 2 according to the instructions:

./tools/build-host-image.sh -B $HELIOS_PATH /work/global-zone-packages.tar.gz

In copied the rom and zfs.img files to jeeves and installed them on sled 16 using a slightly modified version of the script in /data/local/rack2/install_os.sh that allowed installing on sled16 by removing that check and that also removed checking for a bootloop service that didn't exist by commenting that part out.

The script is pasted below

#!/bin/bash

set -o errexit
set -o pipefail

function usage {
    printf 'Usage: %s CUBBY IMAGE\n' "$0"
    printf '\n'
    printf '\t\tCUBBY\t\tcubby number (0-31)\n'
    printf '\t\tIMAGE\t\tpath to directory with zfs.img and rom file\n'
    printf '\n'
}

while getopts 'h' c; do
    case "$c" in
    -h)
        usage
        exit 0
        ;;
    ?)
        usage >&2
        exit 2
        ;;
    esac
done

if (( $# != 2 )); then
    usage >&2
    printf 'ERROR: provide cubby number and image directory\n' >&2
    exit 2
fi

cubby=$(( $1 + 0 ))
if [[ $cubby != $1 ]] || (( cubby < 0 || cubby > 31 )); then
    usage >&2
    printf 'ERROR: not a valid cubby?\n' >&2
    exit 2
fi
if (( cubby == 14 )); then
    usage >&2
    printf 'ERROR: that would be a scrimlet cubby.\n' >&2
    exit 2
fi

#loopfmri="$(printf 'svc:/site/oxide/bootloop:c%02d' "$cubby")"
#if ! sta=$(svcs -Ho sta "$loopfmri") || [[ "$sta" != DIS ]]; then
#   svcs -xv "$loopfmri"
#   printf '\nERROR: is %s disabled?\n' "$loopfmri" >&2
#   exit 2
#fi

image=$2
if [[ -z $image ]] ||  [[ ! -f $image/zfs.img ]] || [[ ! -f $image/rom ]]; then
    usage >&2
    printf 'ERROR: image directory "%s" invalid?\n' "$image" >&2
    exit 2
fi

set -o xtrace

function find_a_switch {
    while :; do
        #
        # XXX we have just added two additional environments to
        # jeeves and pilot currently does not have a facility
        # for discriminating, so to avoid mishaps we are hard-coding
        # the switch from rack2 we currently want to use:
        #
        fas_list=( $( (pilot tp ls -Ho nodename |
            grep BRM42220051-switch) || true) )
        if (( ${#fas_list[@]} < 1 )); then
            sleep 1
            continue
        fi

        #
        # Use the first one we see:
        #
        fas_sw="${fas_list[0]}"

        #
        # Deploy the pilot binary we are using in the switch zone, to
        # make sure it supports everything we need:
        #
        if ! pilot techport copy to \
            -i '/usr/bin/pilot' -o /tmp/pilot "$fas_sw"; then
            sleep 1
            continue
        fi

        printf '%s\n' "$fas_sw"
        return 0
    done
}

function cubby_to_host {
    host=$( (pilot techport exec -c \
        '/tmp/pilot sp ls -o cubby,serial |
        awk "\$1 == '$1' { print \$NF }"' \
        "$sw" || true) | awk '{ print $NF }' || true )

    if [[ -n $host ]]; then
        if [[ $host == '-' ]]; then
            printf 'ERROR: no host in cubby %s?\n' "$1" >&2
            return 1
        fi

        printf '%s\n' "$host"
        return 0
    else
        printf 'ERROR: could not look for hosts in cubby\n' "$1" >&2
        return 1
    fi
}

#
# Wait for at least one switch to come up in case it has not yet:
#
sw=$(find_a_switch)

#
# Map the cubby number to a specific serial:
#
if h=$(cubby_to_host "$cubby"); then
    printf 'cubby %s -> host %s\n' "$cubby" "$h"
else
    exit 1
fi

reset=no
while :; do
    #
    # First, make sure we can see the Gimlet.
    #
    if ! pilot techport exec -c '/tmp/pilot host ls -Ho serial' "$sw" |
        grep -q "$h"; then
        #
        # Gimlet does not appear visible.
        #
        if [[ $reset == yes ]]; then
            #
            # But we have already rebooted it, so just wait.
            #
            printf 'waiting for gimlet %s...\n' "$h"
            sleep 5
            continue
        fi

        printf 'rebooting gimlet %s using BSU 0...\n' "$h"
        pilot sp off "$h"
        pilot sp rom slot -s 0 "$h"
        pilot sp startup -s -k "$h"
        pilot sp on "$h"
        reset=yes
        sleep 5
        continue
    fi

    #
    # Now that the Gimlet is available, copy the image over and write it
    # to BSU 1.
    #
    # Use our pid in the path to try and avoid conflicts with concurrent
    # updates.
    #
    rempath="/tmp/zfs.$$.img"
    pilot techport copy to -i "$image/zfs.img" -o "$rempath" "$sw"
    pilot techport exec -c \
        "/tmp/pilot host copy to -i $rempath -o /tmp/zfs.img $h" \
        "$sw"
    #
    # Try not to accumulate too much detritus:
    #
    pilot techport exec -c "rm -f $rempath" "$sw"

    #
    # Update BSU 1 on the target Gimlet:
    #
    pilot techport exec -c \
        "/tmp/pilot host exec -c 'pilot bsu update 1 /tmp/zfs.img' $h" \
        "$sw"

    #
    # Update the ROM and reboot:
    #
    pilot sp off "$h"
    pilot sp rom update -s 1 -f "$image/rom" "$h"
    pilot sp startup -s -k "$h"
    pilot sp on "$h"
    break
done
andrewjstone commented 1 year ago

The omicron commit I used was

commit 65bc4f7bcd97ae55d6abf987041d997c348dfbd1 (HEAD -> main, origin/main, origin/HEAD)
Author: John Gallagher <john@oxidecomputer.com>
Date:   Thu Mar 16 00:00:28 2023 -0400

    Refactor host OS CI scripts to allow running them locally (#2557)

    This creates a new `./tools/build-host-image.sh` script which is
    extracted from the existing CI jobs to build host and trampoline images;
    those CI jobs now call this script (after doing some buildomat-specific
    setup).
andrewjstone commented 1 year ago

The cursed host image that was built and was installed in boot slot 1 on sled 16 resides here: /net/catacomb/data/staff/core/hubris-1213/cursed-host-image.tar.gz

citrus-it commented 1 year ago

Since it was to hand, I first put the cursed image bits onto B/06 in the lab. This is using a hubris image from Feb 21st. There were no apparent problems at all, the server booted up, the sensor readings that humility shows are all within range, and there is nothing in the thermal task ringbuffer apart from failure to talk to a few devices which are not present in this sled.

I then updated to hubris master, in case that was a factor, and there was no difference:

 NDX LINE      GEN    COUNT PAYLOAD
  24  586       64        1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
  25  586       64        1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
  26  586       64        1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
  27  884       64        1 ControlPwm(0x0)
  28  586       64        1 MiscReadFailed(SensorId(0x0), I2cError(NoDevice))
  29  586       64        1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
  30  586       64        1 MiscReadFailed(SensorId(0x70), I2cError(NoDevice))
  31  586       64        1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
   0  586       65        1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
   1  586       65        1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
   2  884       65        1 ControlPwm(0x0)
   3  586       65        1 MiscReadFailed(SensorId(0x0), I2cError(NoDevice))
   4  586       65        1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
   5  586       65        1 MiscReadFailed(SensorId(0x70), I2cError(NoDevice))
   6  586       65        1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
   7  586       65        1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
   8  586       65        1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
   9  884       65        1 ControlPwm(0x0)
  10  586       65        1 MiscReadFailed(SensorId(0x0), I2cError(NoDevice))
  11  586       65        1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
  12  586       65        1 MiscReadFailed(SensorId(0x70), I2cError(NoDevice))
  13  586       65        1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
  14  586       65        1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
  15  586       65        1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
  16  884       65        1 ControlPwm(0x0)
  17  586       65        1 MiscReadFailed(SensorId(0x0), I2cError(NoDevice))
  18  586       65        1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
  19  586       65        1 MiscReadFailed(SensorId(0x70), I2cError(NoDevice))
  20  586       65        1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
  21  586       65        1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
  22  586       65        1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
  23  884       65        1 ControlPwm(0x0)

If there is a problem with this image, and it certainly behaved differently to another on BRM42220067, it may not manifest on a Rev.B Gimlet. It's more likely that whatever is wrong with BRM42220067 - see https://github.com/oxidecomputer/hardware-gimlet/issues/1895 - is triggering the thermal shutdown, but I do not yet know why it happens with this OS image and not with another, there should be no difference there.