oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

MUPdate failure: Downloading installinator appears stuck #3412

Open smklein opened 1 year ago

smklein commented 1 year ago

On 6/23, I attempted to MUPdate the dogfood rack.

Sled 21 showed "Downloading installinator", with the following expanded message:

image

It's unclear to me why it is stuck here - it looks like it should be complete?

jgallagher commented 1 year ago

When I hopped on the console, instead of a login prompted I was greeted with

Enter user name for system maintenance (control-d to bypass):

I left it in this state for someone else to poke at.

rcgoodfellow commented 1 year ago

Found services in this state

root@:/var/adm# svcs -xv
svc:/network/physical:default (physical network interfaces)
 State: offline since Sun Dec 28 00:00:04 1986
Reason: Start method is running.
   See: http://illumos.org/msg/SMF-8000-C4
   See: man -M /usr/share/man -s 8 ifconfig
   See: /var/svc/log/network-physical:default.log
Impact: 26 dependent services are not running:
        svc:/milestone/network:default
        svc:/network/initial:default
        svc:/network/service:default
        svc:/network/dns/client:default
        svc:/milestone/name-services:default
        svc:/milestone/multi-user:default
        svc:/system/boot-config:default
        svc:/milestone/multi-user-server:default
        svc:/system/system-log:default
        svc:/system/cron:default
        svc:/network/netmask:default
        svc:/milestone/single-user:default
        svc:/milestone/sysconfig:default
        svc:/system/utmp:default
        svc:/system/console-login:default
        svc:/system/sac:default
        svc:/system/filesystem/local:default
        svc:/system/dumpadm:default
        svc:/system/hotplug:default
        svc:/system/t6init:default
        svc:/oxide/installinator:default
        svc:/system/boot-archive-update:default
        svc:/network/routing-setup:default
        svc:/system/identity:node
        svc:/system/picl:default
        svc:/system/identity:domain

svc:/network/physical:nwam (physical network interface autoconfiguration)
 State: disabled since Sun Dec 28 00:00:03 1986
Reason: Disabled by an administrator.
   See: http://illumos.org/msg/SMF-8000-05
   See: man -M /usr/share/man -s 8 nwamd
   See: http://hub.opensolaris.org/bin/view/Project+nwam/
Impact: 23 dependent services are not running:
        svc:/milestone/network:default
        svc:/network/initial:default
        svc:/network/service:default
        svc:/network/dns/client:default
        svc:/milestone/name-services:default
        svc:/milestone/multi-user:default
        svc:/system/boot-config:default
        svc:/milestone/multi-user-server:default
        svc:/system/system-log:default
        svc:/system/cron:default
        svc:/network/netmask:default
        svc:/milestone/single-user:default
        svc:/milestone/sysconfig:default
        svc:/system/utmp:default
        svc:/system/console-login:default
        svc:/system/sac:default
        svc:/system/filesystem/local:default
        svc:/system/dumpadm:default
        svc:/system/hotplug:default
        svc:/system/t6init:default
        svc:/oxide/installinator:default
        svc:/system/boot-archive-update:default
        svc:/network/routing-setup:default

svc:/system/rbac:default (Assemble the RBAC *attr files.)
 State: offline since Sun Dec 28 00:00:05 1986
Reason: Start method is running.
   See: http://illumos.org/msg/SMF-8000-C4
   See: /var/svc/log/system-rbac:default.log
Impact: 20 dependent services are not running:
        svc:/system/manifest-import:default
        svc:/system/boot-config:default
        svc:/milestone/single-user:default
        svc:/milestone/multi-user:default
        svc:/milestone/multi-user-server:default
        svc:/milestone/sysconfig:default
        svc:/system/system-log:default
        svc:/system/utmp:default
        svc:/system/console-login:default
        svc:/system/sac:default
        svc:/system/filesystem/local:default
        svc:/system/cron:default
        svc:/system/dumpadm:default
        svc:/system/hotplug:default
        svc:/system/t6init:default
        svc:/oxide/installinator:default
        svc:/system/boot-archive-update:default
        svc:/network/routing-setup:default
        svc:/system/coreadm:default
        svc:/system/name-service-cache:default

svc:/site/recovery/hostname:default (recovery hostname)
 State: offline since Sun Dec 28 00:00:05 1986
Reason: Start method is running.
   See: http://illumos.org/msg/SMF-8000-C4
   See: /var/svc/log/site-recovery-hostname:default.log
Impact: 19 dependent services are not running:
        svc:/system/identity:node
        svc:/milestone/single-user:default
        svc:/milestone/multi-user:default
        svc:/system/boot-config:default
        svc:/milestone/multi-user-server:default
        svc:/milestone/sysconfig:default
        svc:/system/system-log:default
        svc:/system/utmp:default
        svc:/system/console-login:default
        svc:/system/sac:default
        svc:/system/filesystem/local:default
        svc:/system/cron:default
        svc:/system/dumpadm:default
        svc:/system/hotplug:default
        svc:/system/t6init:default
        svc:/oxide/installinator:default
        svc:/system/boot-archive-update:default
        svc:/system/picl:default
        svc:/system/identity:domain

svc:/system/sysevent:default (system event notification)
 State: offline since Sun Dec 28 00:00:05 1986
Reason: Start method is running.
   See: http://illumos.org/msg/SMF-8000-C4
   See: man -M /usr/share/man -s 8 syseventd
   See: /var/svc/log/system-sysevent:default.log
Impact: 17 dependent services are not running:
        svc:/milestone/single-user:default
        svc:/milestone/multi-user:default
        svc:/system/boot-config:default
        svc:/milestone/multi-user-server:default
        svc:/milestone/sysconfig:default
        svc:/system/system-log:default
        svc:/system/utmp:default
        svc:/system/console-login:default
        svc:/system/sac:default
        svc:/system/filesystem/local:default
        svc:/system/cron:default
        svc:/system/dumpadm:default
        svc:/system/hotplug:default
        svc:/system/t6init:default
        svc:/oxide/installinator:default
        svc:/system/boot-archive-update:default
        svc:/system/picl:default

svc:/system/logadm-upgrade:default (logadm upgrade)
 State: offline since Sun Dec 28 00:00:05 1986
Reason: Start method is running.
   See: http://illumos.org/msg/SMF-8000-C4
   See: /var/svc/log/system-logadm-upgrade:default.log
Impact: This service is not running.

physical networking angry about no egrep.

root@:/var/adm# cat /var/svc/log/network-physical:default.log
[ Dec 28 00:00:03 Enabled. ]
[ Dec 28 00:00:04 Executing start method ("/lib/svc/method/net-physical"). ]
[ Dec 28 00:00:04 Timeout override by svc.startd.  Using infinite timeout. ]
/lib/svc/method/net-physical[469]: egrep: not found [No such file or directory]
dladm: warning: kstat open operation failed
jclulow commented 1 year ago

How did we decide to remove egrep?

jgallagher commented 1 year ago

It looks like this is another casualty of https://github.com/oxidecomputer/helios/pull/88, but I'm a little confused how worked on the remaining sleds. Is the use of egrep conditional (and not particularly common)?

rcgoodfellow commented 1 year ago

It does not appear to be conditional.

However, this was in maintenance/single-user mode. So it's not clear if this is an artifact of being in maintenance mode or the reason we are in maintenance mode.

jgallagher commented 1 year ago

egrep was restored in https://github.com/oxidecomputer/helios/pull/96, so if its lack was the root cause this is fixed. Leaving open but changing the milestone for now, until/if we see it again.