Open mkeeter opened 1 year ago
There's nothing in LAST_HOST_PANIC
or LAST_HOST_BOOT_FAIL
, and the host thinks that it made it all the way to console login:
BRM42220051-switch # pilot sp console 16
attaching to console; to detach, press: Ctrl-A, Ctrl-X...
Mar 16 19:14:35.259 INFO creating SP handle on interface gimlet16, component: faux-mgs
Mar 16 19:14:35.260 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:205%38]:11111, interface: gimlet16, component: faux-mgs
Oxide Pico Host Boot Loader
Config {
cons: Uart(fedc9000),
loader: 0x7f509000..0x7fff0000
pageroot: P4KA(0x7ff6f000),
}
Decompressing cpio archive to 0x77509000..0x7f509000...Done.
jumping to kernel entry at 0xfffffffffbc27730
Configured AGPIO139: 50300 (input is high)
-----------> Sending IPCC command 0x8, attempt 1/10
Received empty frame
Additional data length: 0x10
-----------> Sending IPCC command 0x3, attempt 1/10
Received empty frame
Additional data length: 0x1
-----------> Sending IPCC command 0x4, attempt 1/10
Received empty frame
Additional data length: 0x6a
Loading kmdb...
NOTICE: Socket 0 SMU Version: 45.63.0
NOTICE: Socket 0 DXIO Version: 45.679
NOTICE: Socket 0 SMU features 0x0690fbfd enabled
cpu0: microcode has been updated from version 0x0 to 0xa0011ce
Oxide Helios Version stlouis-0-g65c574a774 64-bit (onu)
WARNING: Socket 0 SM 0x0->0xf
WARNING: XXX skipping a ton of mapped stuff
NOTICE: Finished writing PCIe straps.
WARNING: Socket 0 SM 0xf->0x5
WARNING: XXX skipping a ton of configured stuff
WARNING: Socket 0 SM 0x5->0x8
WARNING: let's go deasserting: 1, 1
WARNING: Socket 0 SM 0x8->0xd
WARNING: we're out of here
NOTICE: DXIO devices successfully trained?
NOTICE: mapped entry 0 to port fffffffffbe26c60
NOTICE: mapped entry 1 to port fffffffffbe27c20
NOTICE: mapped entry 2 to port fffffffffbe27618
NOTICE: mapped entry 3 to port fffffffffbe264b0
NOTICE: mapped entry 4 to port fffffffffbe264e0
NOTICE: mapped entry 5 to port fffffffffbe26510
NOTICE: mapped entry 6 to port fffffffffbe26540
NOTICE: mapped entry 7 to port fffffffffbe27df8
NOTICE: mapped entry 8 to port fffffffffbe27dc8
NOTICE: mapped entry 9 to port fffffffffbe27d98
NOTICE: mapped entry 10 to port fffffffffbe27470
NOTICE: mapped entry 11 to port fffffffffbe27440
NOTICE: mapped entry 12 to port fffffffffbe27410
NOTICE: mapped entry 13 to port fffffffffbe26688
in oxide_boot! oxb=fffffcf93071e380
cpio wants: bfd6a11989a1142944c4191b52b64ebe988a183a981ff8abae182f6c2e96a600
attaching stuff...
FCH peripheral: dwu@0, dwu0
FCH peripheral: dwu@1, dwu1
FCH peripheral: dwu@2, dwu2
FCH peripheral: dwu@3, dwu3
TRYING: boot disk (slot 18, slice 0)
NVMe boot devices:
blkdev0 (slot 17)
blkdev8 (slot 6)
blkdev7 (slot 5)
blkdev3 (slot 4)
blkdev9 (slot 9)
blkdev10 (slot 8)
blkdev2 (slot 7)
/pci@38,0/pci1022,1483@3,3/pci1344,3100@0/blkdev@w00A075013280BCB0,0:a (slot 18!)
found M.2 device (slot 18, slice 0), @ /pci@38,0/pci1022,1483@3,3/pci1344,3100@0/blkdev@w00A075013280BCB0,0:a
opening M.2 device
in image: bfd6a11989a1142944c4191b52b64ebe988a183a981ff8abae182f6c2e96a600
opening ramdisk control device
creating ramdisk of size 4294967296
opening ramdisk device: /devices/pseudo/ramdisk@1024:rpool
closing M.2
ramdisk data size = 838860800
checksum ok!
strplumb: failed to initialize drv/ip
Configuring devices.
WARNING: ext_ip_hack disabled: traffic will be encapsulated
Hostname: BRM42220067
Dec 28 00:00:07 BRM42220067 zpool[639]: SMF initialization problem: entity not found
BRM42220067 console login: Dec 28 00:00:07 BRM42220067 last message repeated 27 times
Dec 28 00:00:32 BRM42220067 fch: FCH peripheral: dwu@1, dwu1
Dec 28 00:00:32 BRM42220067 fch: FCH peripheral: dwu@2, dwu2
Dec 28 00:00:32 BRM42220067 fch: FCH peripheral: dwu@3, dwu3
Dec 28 00:00:32 BRM42220067 xde: WARNING: ext_ip_hack disabled: traffic will be encapsulated
Dec 28 00:00:34 BRM42220067 genunix: WARNING: (pcieb16): failed to attach driver for a device (pci1de,fff9-1) under the Connection pcie16
Dec 28 00:00:34 BRM42220067 last message repeated 3 times
I built the cursed host image using the build-host-image.sh
script from https://github.com/oxidecomputer/omicron/pull/2557 following the instructions here, except I built using helios
instead of helios-engvm
.
I cloned helios and checked it out on master, which matched the helios version shown in omicron/tools/helios_version
:
commit 49d501d2f37060e29a84a50e9026860315975794 (HEAD -> master, origin/master, origin/HEAD)
Author: Sean Klein <sean@oxide.computer>
Date: Wed Mar 8 13:47:39 2023 -0500
image: increase size of default image for omicron
Following the instructions I generated zone images for omicron:
$ ./.github/buildomat/jobs/package.sh
I then built the standard host images to install on sled 16 (scrimlet) in rack 2 according to the instructions:
./tools/build-host-image.sh -B $HELIOS_PATH /work/global-zone-packages.tar.gz
In copied the rom
and zfs.img
files to jeeves and installed them on sled 16 using a slightly modified version of the script in /data/local/rack2/install_os.sh
that allowed installing on sled16 by removing that check and that also removed checking for a bootloop service that didn't exist by commenting that part out.
The script is pasted below
#!/bin/bash
set -o errexit
set -o pipefail
function usage {
printf 'Usage: %s CUBBY IMAGE\n' "$0"
printf '\n'
printf '\t\tCUBBY\t\tcubby number (0-31)\n'
printf '\t\tIMAGE\t\tpath to directory with zfs.img and rom file\n'
printf '\n'
}
while getopts 'h' c; do
case "$c" in
-h)
usage
exit 0
;;
?)
usage >&2
exit 2
;;
esac
done
if (( $# != 2 )); then
usage >&2
printf 'ERROR: provide cubby number and image directory\n' >&2
exit 2
fi
cubby=$(( $1 + 0 ))
if [[ $cubby != $1 ]] || (( cubby < 0 || cubby > 31 )); then
usage >&2
printf 'ERROR: not a valid cubby?\n' >&2
exit 2
fi
if (( cubby == 14 )); then
usage >&2
printf 'ERROR: that would be a scrimlet cubby.\n' >&2
exit 2
fi
#loopfmri="$(printf 'svc:/site/oxide/bootloop:c%02d' "$cubby")"
#if ! sta=$(svcs -Ho sta "$loopfmri") || [[ "$sta" != DIS ]]; then
# svcs -xv "$loopfmri"
# printf '\nERROR: is %s disabled?\n' "$loopfmri" >&2
# exit 2
#fi
image=$2
if [[ -z $image ]] || [[ ! -f $image/zfs.img ]] || [[ ! -f $image/rom ]]; then
usage >&2
printf 'ERROR: image directory "%s" invalid?\n' "$image" >&2
exit 2
fi
set -o xtrace
function find_a_switch {
while :; do
#
# XXX we have just added two additional environments to
# jeeves and pilot currently does not have a facility
# for discriminating, so to avoid mishaps we are hard-coding
# the switch from rack2 we currently want to use:
#
fas_list=( $( (pilot tp ls -Ho nodename |
grep BRM42220051-switch) || true) )
if (( ${#fas_list[@]} < 1 )); then
sleep 1
continue
fi
#
# Use the first one we see:
#
fas_sw="${fas_list[0]}"
#
# Deploy the pilot binary we are using in the switch zone, to
# make sure it supports everything we need:
#
if ! pilot techport copy to \
-i '/usr/bin/pilot' -o /tmp/pilot "$fas_sw"; then
sleep 1
continue
fi
printf '%s\n' "$fas_sw"
return 0
done
}
function cubby_to_host {
host=$( (pilot techport exec -c \
'/tmp/pilot sp ls -o cubby,serial |
awk "\$1 == '$1' { print \$NF }"' \
"$sw" || true) | awk '{ print $NF }' || true )
if [[ -n $host ]]; then
if [[ $host == '-' ]]; then
printf 'ERROR: no host in cubby %s?\n' "$1" >&2
return 1
fi
printf '%s\n' "$host"
return 0
else
printf 'ERROR: could not look for hosts in cubby\n' "$1" >&2
return 1
fi
}
#
# Wait for at least one switch to come up in case it has not yet:
#
sw=$(find_a_switch)
#
# Map the cubby number to a specific serial:
#
if h=$(cubby_to_host "$cubby"); then
printf 'cubby %s -> host %s\n' "$cubby" "$h"
else
exit 1
fi
reset=no
while :; do
#
# First, make sure we can see the Gimlet.
#
if ! pilot techport exec -c '/tmp/pilot host ls -Ho serial' "$sw" |
grep -q "$h"; then
#
# Gimlet does not appear visible.
#
if [[ $reset == yes ]]; then
#
# But we have already rebooted it, so just wait.
#
printf 'waiting for gimlet %s...\n' "$h"
sleep 5
continue
fi
printf 'rebooting gimlet %s using BSU 0...\n' "$h"
pilot sp off "$h"
pilot sp rom slot -s 0 "$h"
pilot sp startup -s -k "$h"
pilot sp on "$h"
reset=yes
sleep 5
continue
fi
#
# Now that the Gimlet is available, copy the image over and write it
# to BSU 1.
#
# Use our pid in the path to try and avoid conflicts with concurrent
# updates.
#
rempath="/tmp/zfs.$$.img"
pilot techport copy to -i "$image/zfs.img" -o "$rempath" "$sw"
pilot techport exec -c \
"/tmp/pilot host copy to -i $rempath -o /tmp/zfs.img $h" \
"$sw"
#
# Try not to accumulate too much detritus:
#
pilot techport exec -c "rm -f $rempath" "$sw"
#
# Update BSU 1 on the target Gimlet:
#
pilot techport exec -c \
"/tmp/pilot host exec -c 'pilot bsu update 1 /tmp/zfs.img' $h" \
"$sw"
#
# Update the ROM and reboot:
#
pilot sp off "$h"
pilot sp rom update -s 1 -f "$image/rom" "$h"
pilot sp startup -s -k "$h"
pilot sp on "$h"
break
done
The omicron commit I used was
commit 65bc4f7bcd97ae55d6abf987041d997c348dfbd1 (HEAD -> main, origin/main, origin/HEAD)
Author: John Gallagher <john@oxidecomputer.com>
Date: Thu Mar 16 00:00:28 2023 -0400
Refactor host OS CI scripts to allow running them locally (#2557)
This creates a new `./tools/build-host-image.sh` script which is
extracted from the existing CI jobs to build host and trampoline images;
those CI jobs now call this script (after doing some buildomat-specific
setup).
The cursed host image that was built and was installed in boot slot 1 on sled 16 resides
here: /net/catacomb/data/staff/core/hubris-1213/cursed-host-image.tar.gz
Since it was to hand, I first put the cursed image bits onto B/06 in the lab. This is using a hubris image from Feb 21st. There were no apparent problems at all, the server booted up, the sensor readings that humility shows are all within range, and there is nothing in the thermal task ringbuffer apart from failure to talk to a few devices which are not present in this sled.
I then updated to hubris master, in case that was a factor, and there was no difference:
NDX LINE GEN COUNT PAYLOAD
24 586 64 1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
25 586 64 1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
26 586 64 1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
27 884 64 1 ControlPwm(0x0)
28 586 64 1 MiscReadFailed(SensorId(0x0), I2cError(NoDevice))
29 586 64 1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
30 586 64 1 MiscReadFailed(SensorId(0x70), I2cError(NoDevice))
31 586 64 1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
0 586 65 1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
1 586 65 1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
2 884 65 1 ControlPwm(0x0)
3 586 65 1 MiscReadFailed(SensorId(0x0), I2cError(NoDevice))
4 586 65 1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
5 586 65 1 MiscReadFailed(SensorId(0x70), I2cError(NoDevice))
6 586 65 1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
7 586 65 1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
8 586 65 1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
9 884 65 1 ControlPwm(0x0)
10 586 65 1 MiscReadFailed(SensorId(0x0), I2cError(NoDevice))
11 586 65 1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
12 586 65 1 MiscReadFailed(SensorId(0x70), I2cError(NoDevice))
13 586 65 1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
14 586 65 1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
15 586 65 1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
16 884 65 1 ControlPwm(0x0)
17 586 65 1 MiscReadFailed(SensorId(0x0), I2cError(NoDevice))
18 586 65 1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
19 586 65 1 MiscReadFailed(SensorId(0x70), I2cError(NoDevice))
20 586 65 1 MiscReadFailed(SensorId(0x6e), I2cError(NoDevice))
21 586 65 1 MiscReadFailed(SensorId(0x6f), I2cError(NoDevice))
22 586 65 1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
23 884 65 1 ControlPwm(0x0)
If there is a problem with this image, and it certainly behaved differently to another on BRM42220067
, it may not manifest on a Rev.B Gimlet. It's more likely that whatever is wrong with BRM42220067 - see https://github.com/oxidecomputer/hardware-gimlet/issues/1895 - is triggering the thermal shutdown, but I do not yet know why it happens with this OS image and not with another, there should be no difference there.
When @andrewjstone was testing a new host image on Rack 2, we noticed that fans spun up and the system shut down.
Looking at the ringbuf, it looks like the usual sensor reading failure, followed by the thermal loop sending the system to A2:
However, there's some extra weirdness in there. This happens within ~60 seconds of booting a particular host image, but works fine with stock images. We also see communication issues with SB-TSI (3H bus) in addition to the usual 2F, which is unusual. Finally, one of the RAM power regulators (
VDD_MEM_EFGH
) thinks that it's at 247°C and drawing 115A.None of the failing buses think they have a mux segment selected, which is unusual:
Relevant files: