oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

sled-agent leaks contracts when executing commands in an NGZ #3753

Closed citrus-it closed 1 year ago

citrus-it commented 1 year ago

I came across a gimlet in a state this morning where I was unable to log in because the SSH server could not fork.

We're seeing a lot of fork failures from sled-agent too.

BRM42220014 # dtrace -n 'forksys:return{@[execname,arg1==-1,errno]=count()}'
dtrace: description 'forksys:return' matched 2 probes
^C

  sled-agent                                                1       11               32
  devfsadm                                                  0        0               36
  ksh93                                                     0        0               36
  tfportd                                                   0        0

and we're seeing the misc fork failure counter increasing:

> zone0::print ! grep zone_ff
    zone_ffcap = 0
    zone_ffnoproc = 0
    zone_ffnomem = 0
    zone_ffmisc = 0x4976
> zone0::print ! grep zone_ff
    zone_ffcap = 0
    zone_ffnoproc = 0
    zone_ffnomem = 0
    zone_ffmisc = 0x4979

After a bit of tracing, we find that the failing function is contract_process_fork():

BRM42220014 # dtrace -n 'contract_process_fork:return/execname=="sled-agent"/{trace(arg1)}'
dtrace: description 'contract_process_fork:return' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  38886     contract_process_fork:return                 0
 52  38886     contract_process_fork:return                 0
 52  38886     contract_process_fork:return                 0
 52  38886     contract_process_fork:return                 0
 50  38886     contract_process_fork:return                 0
111  38886     contract_process_fork:return                 0

How many contracts does sled agent have?

BRM42220014 # ptree `pgrep sled-agent`
652    ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /
  654    /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.

BRM42220014 # ctstat | awk '$5 == 652 || $5 == 654 { print $5 }' | sort | uniq -c
   1 652
9964 654

That 9964 is suspiciously close to 10,000. What's the contract limit for sled-agent?

BRM42220014 # prctl -i process -n project.max-contracts `pgrep sled-agent`
process: 654: /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
project.max-contracts
        privileged      10.0K       -   deny                                 -

Picking one of them:

BRM42220014 # ctstat -av -i 10275
CTID    ZONEID  TYPE    STATE   HOLDER  EVENTS  QTIME   NTIME
10275   0       process owned   654     0       -       -
    cookie:                0
    informative event set: none
    critical event set:    none
    fatal event set:       hwerr
    parameter set:         pgrponly regent
    member processes:      none
    inherited contracts:   none
    service fmri:          svc:/oxide/sled-agent:default
    service fmri ctid:     60
    creator:               sled-agent
    aux:

The problem here seems to be that sled-agent is creating a new contract for running a command inside a zone, but it is allowing the contract to remain around once the child process has completed.

jordanhendricks commented 1 year ago

The contract issue aside, I was curious why sled agent ended up running so many commands over and over again, thus hitting this resource limit. It seems that it is repeatedly trying to run chronyc -c tracking inside the ntp zone:

BRM42220014 # grep "command failed" $(svcs -L sled-agent) | looker | head
03:15:08.889Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:09.733Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:10.519Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:11.289Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:12.550Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:13.580Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:14.385Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:15.026Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:15.971Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:17.351Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
thread 'main' panicked at 'failed printing to stdout: Broken pipe (os error 32)', library/std/src/io/stdio.rs:1019:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
BRM42220014 # grep "command failed" $(svcs -L sled-agent) | wc -l
   25483

Presumably, those are all failing due to the contract resource exhaustion already identified. But what was the first failure we saw? To see that, we have to go back to an already rotated log:

BRM42220014 # grep "command failed" /var/svc/log/oxide-sled-agent\:default.log.0  | looker
00:04:29.877Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Command [/usr/bin/chronyc -c tracking] executed and failed with status: exit status: 1  stdout: 506 Cannot talk to daemon
      stderr: 
00:04:30.217Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Command [/usr/bin/chronyc -c tracking] executed and failed with status: exit status: 1  stdout: 506 Cannot talk to daemon
      stderr: 

The ntp zone exists, but I didn't find much in it the ntp service logs:

root@oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b:~# cat $(svcs -L ntp) 
[ Dec 28 00:04:25 Disabled. ]
[ Dec 28 00:04:25 Rereading configuration. ]
[ Dec 28 00:04:29 Rereading configuration. ]
[ Dec 28 00:04:29 Rereading configuration. ]
[ Dec 28 00:04:29 Enabled. ]
[ Dec 28 00:04:29 Executing start method ("/var/svc/method/svc-site-ntp start"). ]
NTP Service Configuration
-------------------------
 Servers: 8d3d0f3f-a108-4bb1-93eb-4350bc966644.host.control-plane.oxide.internal acfdc98a-4c5a-4fa9-a778-dfa63c063f2c.host.control-plane.oxide.internal
   Allow: fd00:1122:3344:100::/56
Boundary: false
Template: /etc/inet/chrony.conf.internal
  Config: /etc/inet/chrony.conf
* Updating logadm
* Starting daemon
[ Dec 28 00:04:30 Method "start" exited with status 0. ]

I've put the first sled agent log on catacomb:

jordan@catacomb ~ $ ls /data/staff/dogfood/jul-24/omicron-3753/
oxide-sled-agent:default.log.0
askfongjojo commented 1 year ago

@citrus-it: Can we close this with https://github.com/oxidecomputer/omicron/pull/3761 or there is some remaining work?

citrus-it commented 1 year ago

There was a followup part in #3765 but that's integrated too. Closing as fixed.