oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Wicket is stuck at "Downloading installinator" after reaching 100% #7127

Open askfongjojo opened 23 hours ago

askfongjojo commented 23 hours ago

The issue was encountered several times recently. Here are the observations from a recent occurrence:

The update got stuck on 1 of the 3 sleds in a racklet:

[sled 14 00:37:38]  Progress (13/19) Downloading installinator, waiting for it to start:
 100.00% (135092408/135092408 bytes) after 2175.14s
[        00:37:39]    Status 2/3: 1 running, 2 completed

Sled 14 is stuck in a crash/reboot loop:

Oxide Pico Host Boot Loader
Config {
    cons:   Uart(fedc9000),
    loader: 0x7f510000..0x7fff0000
    pageroot: P4KA(0x7f511000),
}
Decompressing cpio archive to 0x77510000..0x7f510000...Done.
jumping into kernel...
Oxide board Gimlet -- GN-B1
Socket 0 SMU Version: 45.93.0
Socket 0 DXIO Version: 45.682
Socket 0 SMU features 0x0690fbfd enabled
cpu0: microcode has been updated from version 0x0 to 0xa0011d5
Oxide Helios Version helios-2.0.23031 64-bit

-----------> Sending IPCC command 0xe, attempt 1/10
Additional data length: 0x3
Received empty frame
Additional data length: 0x1
NOTICE: Starting Oxide boot
NOTICE:     Phase 1 wants: 4169531839785691f02ef8a52ad08c17d557f9bf5f133216e14df73afc98c14b
NOTICE: TRYING: boot sp
WARNING: first block too small for disk header, got 0x0

-----------> Sending IPCC command 0x6, attempt 1/10
Additional data length: 0x30
Received empty frame
Additional data length: 0x0
WARNING: Could not find a valid phase2 image on sp

-----------> Sending IPCC command 0x6, attempt 1/10
Additional data length: 0x2b
Received empty frame
Additional data length: 0x0

panic[cpu0]/thread=fffffffffc0aaaa0: Could not find a valid phase2 image on sp

Warning - stack not written to the dump buffer
fffffffffc0b0920 boot_image:oxide_boot_locate+18a ()
fffffffffc0b0950 genunix:boot_image_locate+a1 ()
fffffffffc0b0990 genunix:main+137 ()
fffffffffc0b09a0 unix:_locore_start+88 ()

skipping system dump - no dump device configured
rebooting...