Closed citrus-it closed 1 year ago
The contract issue aside, I was curious why sled agent ended up running so many commands over and over again, thus hitting this resource limit. It seems that it is repeatedly trying to run chronyc -c tracking
inside the ntp zone:
BRM42220014 # grep "command failed" $(svcs -L sled-agent) | looker | head
03:15:08.889Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:09.733Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:10.519Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:11.289Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:12.550Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:13.580Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:14.385Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:15.026Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:15.971Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
03:15:17.351Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Failed to start execution of [/usr/bin/chronyc -c tracking]: Resource temporarily unavailable (os error 11)
thread 'main' panicked at 'failed printing to stdout: Broken pipe (os error 32)', library/std/src/io/stdio.rs:1019:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
BRM42220014 # grep "command failed" $(svcs -L sled-agent) | wc -l
25483
Presumably, those are all failing due to the contract resource exhaustion already identified. But what was the first failure we saw? To see that, we have to go back to an already rotated log:
BRM42220014 # grep "command failed" /var/svc/log/oxide-sled-agent\:default.log.0 | looker
00:04:29.877Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Command [/usr/bin/chronyc -c tracking] executed and failed with status: exit status: 1 stdout: 506 Cannot talk to daemon
stderr:
00:04:30.217Z INFO SledAgent (BootstrapAgent): chronyc command failed: Error running command in zone 'oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b': Command [/usr/bin/chronyc -c tracking] executed and failed with status: exit status: 1 stdout: 506 Cannot talk to daemon
stderr:
The ntp zone exists, but I didn't find much in it the ntp service logs:
root@oxz_ntp_99670b05-463d-464a-9a3d-24a91c53b20b:~# cat $(svcs -L ntp)
[ Dec 28 00:04:25 Disabled. ]
[ Dec 28 00:04:25 Rereading configuration. ]
[ Dec 28 00:04:29 Rereading configuration. ]
[ Dec 28 00:04:29 Rereading configuration. ]
[ Dec 28 00:04:29 Enabled. ]
[ Dec 28 00:04:29 Executing start method ("/var/svc/method/svc-site-ntp start"). ]
NTP Service Configuration
-------------------------
Servers: 8d3d0f3f-a108-4bb1-93eb-4350bc966644.host.control-plane.oxide.internal acfdc98a-4c5a-4fa9-a778-dfa63c063f2c.host.control-plane.oxide.internal
Allow: fd00:1122:3344:100::/56
Boundary: false
Template: /etc/inet/chrony.conf.internal
Config: /etc/inet/chrony.conf
* Updating logadm
* Starting daemon
[ Dec 28 00:04:30 Method "start" exited with status 0. ]
I've put the first sled agent log on catacomb:
jordan@catacomb ~ $ ls /data/staff/dogfood/jul-24/omicron-3753/
oxide-sled-agent:default.log.0
@citrus-it: Can we close this with https://github.com/oxidecomputer/omicron/pull/3761 or there is some remaining work?
There was a followup part in #3765 but that's integrated too. Closing as fixed.
I came across a gimlet in a state this morning where I was unable to log in because the SSH server could not fork.
We're seeing a lot of fork failures from
sled-agent
too.and we're seeing the misc fork failure counter increasing:
After a bit of tracing, we find that the failing function is
contract_process_fork()
:How many contracts does sled agent have?
That 9964 is suspiciously close to 10,000. What's the contract limit for sled-agent?
Picking one of them:
The problem here seems to be that sled-agent is creating a new contract for running a command inside a zone, but it is allowing the contract to remain around once the child process has completed.