Closed leftwo closed 7 months ago
The bench gimlet is using 7 of the real SSDs in the box, and has a crucible zone for each of these 7.
Some crucible zones do report that services have failed:
root@EVT22200005:/alan# svcs -xZ
svc:/oxide/crucible/downstairs:downstairs-9d65c793-105b-49e3-bbeb-440afe05bc24 (Oxide Crucible Downstairs)
Zone: oxz_crucible_18e85760-0da4-4369-943d-9927e7569d54
State: maintenance since Wed Feb 21 02:36:34 2024
Reason: Method failed repeatedly.
See: http://illumos.org/msg/SMF-8000-8Q
See: /pool/ext/31bd71cd-4736-4a12-a387-9b74b050396f/crypt/zone/oxz_crucible_18e85760-0da4-4369-943d-9927e7569d54/root/var/svc/log/oxide-crucible-downstairs:downstairs-9d65c793-105b-49e3-bbeb-440afe05bc24.log
Impact: This service is not running.
svc:/network/physical:default (physical network interfaces)
Zone: oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a
State: offline since Wed Feb 21 02:43:24 2024
Reason: Start method is running.
See: http://illumos.org/msg/SMF-8000-C4
See: ifconfig(8)
See: /pool/ext/f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a/root/var/svc/log/network-physical:default.log
Impact: This service is not running.
svc:/oxide/crucible/agent:default (Oxide Crucible Downstairs)
Zone: oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a
State: offline since Wed Feb 21 02:43:24 2024
Reason: Start method is running.
See: http://illumos.org/msg/SMF-8000-C4
See: /pool/ext/f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a/root/var/svc/log/oxide-crucible-agent:default.log
Impact: This service is not running.
svc:/oxide/crucible/downstairs:downstairs-94ac7ea8-f434-41b9-8389-771a2f9754c4 (Oxide Crucible Downstairs)
Zone: oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a
State: maintenance since Wed Feb 21 02:36:34 2024
Reason: Method failed repeatedly.
See: http://illumos.org/msg/SMF-8000-8Q
See: /pool/ext/f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a/root/var/svc/log/oxide-crucible-downstairs:downstairs-94ac7ea8-f434-41b9-8389-771a2f9754c4.log
Impact: This service is not running.
I would expect problems trying to delete a disk if the agent had died that contained one of that disks downstairs.
From the above, I see one zone with a failed agent: oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a
I see another zone with a failed downstairs: svc:/oxide/crucible/downstairs:downstairs-9d65c793-105b-49e3-bbeb-440afe05bc24
This second one is in Zone: oxz_crucible_18e85760-0da4-4369-943d-9927e7569d54
Agent logs at: /staff/core/issues/crucible-1177
Looking at the agent log, I can see disk creations flowing through it.
However, at this spot, we have a problem:
02:36:50.006Z INFO crucible-agent (datafile): region 1df00af0-43d8-4f3e-bfb9-5a687b1247d7 state: Requested
02:36:50.010Z INFO crucible-agent (dropshot): request completed
latency_us = 5085
local_addr = [fd00:1122:3344:101::18]:32345
method = POST
remote_addr = [fd00:1122:3344:101::a]:63016
req_id = 3b94b433-aaf2-497d-a041-28a1c50982f6
response_code = 200
uri = /crucible/0/regions
02:36:50.010Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
02:36:50.030Z INFO crucible-agent (worker): zfs set reservation of 1342177280 for oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:50.030Z INFO crucible-agent (worker): zfs set quota of 3221225472 for oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:50.122Z INFO crucible-agent (worker): creating region Region { id: RegionId("1df00af0-43d8-4f3e-bfb9-5a687b1247d7"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19075, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7"
region = 1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:50.215Z INFO crucible-agent (dropshot): request completed
latency_us = 176
local_addr = [fd00:1122:3344:101::18]:32345
method = POST
remote_addr = [fd00:1122:3344:101::a]:63016
req_id = d71f80f6-e867-4037-93b9-fc192119143a
response_code = 200
uri = /crucible/0/regions
[ Feb 21 02:36:50 Stopping because all processes in service exited. ]
[ Feb 21 02:36:50 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/agent.sh"). ]
We were creating a region, then something happened to the agent and the service restarted. I did not find any core files anywhere in the zone (or the system at large) so there might be a bug in the coreadm setup somewhere.. Back to the agent, it starts right back up, accepts another request to create a region, then dies:
02:36:51.020Z INFO crucible-agent: listen IP: [fd00:1122:3344:101::18]:32345
02:36:51.020Z INFO crucible-agent: SMF instance name downstairs_prefix: "downstairs"
02:36:51.178Z INFO crucible-agent (dropshot): listening
local_addr = [fd00:1122:3344:101::18]:32345
02:36:51.187Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
02:36:51.223Z INFO crucible-agent (worker): creating region Region { id: RegionId("1df00af0-43d8-4f3e-bfb9-5a687b1247d7"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19075, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7"
region = 1df00af0-43d8-4f3e-bfb9-5a687b1247d7
[ Feb 21 02:36:51 Stopping because all processes in service exited. ]
[ Feb 21 02:36:51 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/agent.sh"). ]
The agent restarts again, and this time is able to create and destroy a few regions before again running into trouble:
02:36:52.085Z INFO crucible-agent: SMF instance name downstairs_prefix: "downstairs"
02:36:52.239Z INFO crucible-agent (dropshot): listening
local_addr = [fd00:1122:3344:101::18]:32345
02:36:52.249Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
02:36:52.285Z INFO crucible-agent (worker): creating region Region { id: RegionId("1df00af0-43d8-4f3e-bfb9-5a687b1247d7"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19075, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7"
region = 1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:52.367Z INFO crucible-agent (worker): region files created ok
region = 1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:52.367Z INFO crucible-agent (datafile): region 1df00af0-43d8-4f3e-bfb9-5a687b1247d7 state: Requested -> Created
...
02:37:22.017Z INFO crucible-agent (datafile): region e327420e-dd0c-4508-a12f-22b894e15289 state: Requested
02:37:22.020Z INFO crucible-agent (dropshot): request completed
latency_us = 4907
local_addr = [fd00:1122:3344:101::18]:32345
method = POST
remote_addr = [fd00:1122:3344:101::a]:50626
req_id = 0e137e78-90f9-4c23-8eef-64211d65ad70
response_code = 200
uri = /crucible/0/regions
02:37:22.021Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
02:37:22.040Z INFO crucible-agent (worker): zfs set reservation of 1342177280 for oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/e327420e-dd0c-4508-a12f-22b894e15289
02:37:22.040Z INFO crucible-agent (worker): zfs set quota of 3221225472 for oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/e327420e-dd0c-4508-a12f-22b894e15289
02:37:22.128Z INFO crucible-agent (worker): creating region Region { id: RegionId("e327420e-dd0c-4508-a12f-22b894e15289"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19075, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/e327420e-dd0c-4508-a12f-22b894e15289"
region = e327420e-dd0c-4508-a12f-22b894e15289
02:37:22.187Z INFO crucible-agent (dropshot): request completed
latency_us = 106
local_addr = [fd00:1122:3344:101::18]:32345
method = POST
remote_addr = [fd00:1122:3344:101::a]:50626
req_id = eedae8c8-2716-4a18-82e2-06907f3c1a33
response_code = 200
uri = /crucible/0/regions
[ Feb 21 02:37:22 Stopping because all processes in service exited. ]
[ Feb 21 02:37:22 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/agent.sh"). ]
Looking at the downstairs that failed, it looks like it was out of resources:
root@oxz_crucible_a2f16190:~# cat /var/svc/log/oxide-crucible-downstairs:downstair
s-94ac7ea8-f434-41b9-8389-771a2f9754c4.log.5
[ Feb 21 02:36:34 Disabled. ]
[ Feb 21 02:36:34 Enabled. ]
[ Feb 21 02:36:34 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]
thread 'main' panicked at /home/alan/.cargo/registry/src/index.crates.io-6f17d22bb
a15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:447:13:
OS can't spawn worker thread: Resource temporarily unavailable (os error 11)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Feb 21 02:36:34 Stopping because all processes in service exited. ]
After an update to Omicron and helios bits, I slowly created disks till I saw a failure.
Initially, I was not get a core file when the agent would dump core. I update coreadm
inside the crucible zone
and set it to dump core in /tmp and I was able to generate a few cores.
From one, here is the stack trace:
> ::status
debugging core file of crucible-downst (64-bit) from oxz_crucible_52624c54-730e-434b-a8fe-bc274b0e1068
file: /opt/oxide/crucible/bin/crucible-downstairs
initial argv: /opt/oxide/crucible/bin/crucible-downstairs create --uuid a5ea81c0-cb1e-40e5-95
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=7137 uid=0 code=-1
> $C ! /tmp/demangle
fffff9ffffdfab70 libc.so.1`_lwp_kill+0xa()
fffff9ffffdfaba0 libc.so.1`raise+0x22(6)
fffff9ffffdfabf0 libc.so.1`abort+0x58()
fffff9ffffdfac00 panic_abort::__rust_start_panic::abort::h97b10551a860ee16+9()
fffff9ffffdfac10 __rust_start_panic+9()
fffff9ffffdfac70 rust_panic+0xd()
fffff9ffffdfad30 std::panicking::rust_panic_with_hook::h1dc68ec7979d63c5+0x2af()
fffff9ffffdfad70 std::panicking::begin_panic_handler::{{closure}}::h73dc9943db337240+0xa0()
fffff9ffffdfad80 std::sys_common::backtrace::__rust_end_short_backtrace::hd0b2ffcbdcddf45c+9()
fffff9ffffdfadb0 rust_begin_unwind+0x44()
fffff9ffffdfadf0 core::panicking::panic_fmt::hf2cd841661acd9ba+0x35()
fffff9ffffdfae70 core::result::unwrap_failed::hd6b81b0a1c6c351a+0x75()
fffff9ffffdfaf50 slog_async::AsyncCoreBuilder<D>::build_no_guard::h3a7942f4d314a4e8+0x1f4()
fffff9ffffdfb4c0 crucible_common::build_logger::hc3ee6a1c9781a657+0x296()
fffff9ffffdfcba0 crucible_downstairs::main::_$u7b$$u7b$closure$u7d$$u7d$::h95ff0b004965dd2d (.llvm.2801909585092955357)+0xf3()
fffff9ffffdfd740 tokio::runtime::park::CachedParkThread::block_on::h4f390498fd03c179+0x99()
fffff9ffffdfe320 tokio::runtime::context::runtime::enter_runtime::hd57aa73e2b8e1b3e+0x145()
fffff9ffffdfeeb0 tokio::runtime::runtime::Runtime::block_on::he75572bbeaa878ca+0x5d()
fffff9ffffdffb80 crucible_downstairs::main::hd590936a926cbd1f+0x94()
fffff9ffffdffb90 std::sys_common::backtrace::__rust_begin_short_backtrace::h2b86b629e7b17e25+6()
fffff9ffffdffbf0 std::rt::lang_start::{{closure}}::hb1df1666a90959b0+0x11()
fffff9ffffdffcd0 std::rt::lang_start_internal::h8fd2fe8457b30241+0x2c9()
fffff9ffffdffcf0 std::rt::lang_start::h132a478d446dfe70+0x25()
fffff9ffffdffd20 _start_crt+0x87()
fffff9ffffdffd30 _start+0x18()
Note that while I did get a core file, I did not get any messages in the agent log indicating there was a core, other than seeing that the service had restarted.
Here is the agent log before and after the core:
05:07:43.952Z INFO crucible-agent (dropshot): listening
local_addr = [fd00:1122:3344:101::15]:32345
05:07:43.961Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
05:07:43.996Z INFO crucible-agent (worker): creating region Region { id: RegionId("a5ea81c0-cb1e-40e5-9583-6a01c40a2736"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19062, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/a5ea81c0-cb1e-40e5-9583-6a01c40a2736"
region = a5ea81c0-cb1e-40e5-9583-6a01c40a2736
05:07:43.996Z INFO crucible-agent (worker): command is: env -i "/opt/oxide/crucible/bin/crucible-downstairs" "create" "--uuid" "a5ea81c0-cb1e-40e5-9583-6a01c40a2736" "--data" "/data/regions/a5ea81c0-cb1e-40e5-9583-6a01c40a2736" "--block-size" "4096" "--extent-size" "16384" "--extent-count" "16" "--encrypted"
region = a5ea81c0-cb1e-40e5-9583-6a01c40a2736
[ Feb 24 05:07:44 Stopping because all processes in service exited. ]
(Note the log line with "command is: " was added for debugging and is not present in the source.
I'm able to, with sometimes multiple tries, reproduce a failure in crucible-downstairs create
which is one of the
things the crucible-agent
will call as part of a sub process.
If I truss the failing command, it looks like this:
root@oxz_crucible_06f535c7:~# truss /opt/oxide/crucible/bin/crucible-downstairs create --uuid 9403b3c6-a59c-4209-b765-421a4c1886f3 --data /data/regions/9403b3c6-a59c-4209-b765-421a4c1886f3 --block-size 4096 --extent-size 16384 --extent-count 240 --encrypted
And one of the failures looks like this:
/102: yield() = 0
/104: sigaltstack(0xFFFFF9FFD3DFFF30, 0x00000000) = 0
/103: yield() = 0
/102: yield() = 0
/100: yield() = 0
/103: yield() = 0
/100: yield() = 0
/101: open("/proc/self/lwp/101/lwpname", O_WRONLY) = 10
/104: open("/proc/self/lwp/104/lwpname", O_WRONLY) = 11
/103: yield() = 0
/102: yield() = 0
/101: write(10, " t o k i o - r u n t i m".., 21) = 21
/104: write(11, " t o k i o - r u n t i m".., 21) = 21
/101: close(10) = 0
/104: close(11) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) = 105
truss: cannot create lwp to follow child lwp
The other failure I see is from crucible-downstairs. Its truss output looks like this:
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1: brk(0x02135000) = 0
/1: ioctl(1, TCGETA, 0xFFFFF9FFFFDFADB0) = 0
/1: stat("/root/.terminfo", 0xFFFFF9FFFFDFA770) Err#2 ENOENT
/1: stat("/etc/terminfo", 0xFFFFF9FFFFDFA770) Err#2 ENOENT
/1: stat("/lib/terminfo", 0xFFFFF9FFFFDFA770) Err#2 ENOENT
/1: stat("/usr/share/terminfo", 0xFFFFF9FFFFDFA770) = 0
/1: stat("/usr/share/terminfo/t/tmux-256color", 0xFFFFF9FFFFDFA770) Err#2 ENOENT
/1: stat("/usr/share/terminfo/74/tmux-256color", 0xFFFFF9FFFFDFA770) Err#2 ENOENT
/1: stat("/boot/system/data/terminfo", 0xFFFFF9FFFFDFA770) Err#2 ENOENT
/1: ioctl(2, TCGETA, 0xFFFFF9FFFFDFAD70) = 0
/1: brk(0x02145000) = 0
/1: uucopy(0xFFFFF9FFFFDFA6B0, 0xFFFFF9FFE01FFFE8, 24) = 0
/1: lwp_create(0xFFFFF9FFFFDFA7C0, LWP_SUSPENDED, 0xFFFFF9FFFFDFA7BC) Err#11 EAGAIN
/1: mmap(0x00010000, 65536, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON|MAP_ALIGN, 4294967295, 0) = 0xFFFFF9FFEEBB0000
thread '/1: write(2, " t h r e a d '", 8) = 8
main/1: write(2, " m a i n", 4) = 4
' panicked at /1: write(2, " ' p a n i c k e d a".., 14) = 14
/home/alan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/slog-async-2.8.0/lib.rs/1: write(2, " / h o m e / a l a n / .".., 87) = 87
:/1: write(2, " :", 1) = 1
319/1: write(2, " 3 1 9", 3) = 3
:/1: write(2, " :", 1) = 1
14/1: write(2, " 1 4", 2) = 2
:
/1: write(2, " :\n", 2) = 2
called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }/1: write(2, " c a l l e d ` R e s u".., 123) = 123
/1: write(2, "\n", 1) = 1
/1: getcwd("/root", 512) = 0
stack backtrace:
(Some of the unwinding of the stack for the panic has been omitted)
Some system info collected while the situation was reproduced:
root@EVT22200005:/alan# mdb -ke ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 7362138 28758 5%
Boot pages 13 0 0%
ZFS File Data 7805897 30491 6%
VMM Memory 67106304 262134 50%
Anon 3529813 13788 3%
Exec and libs 316802 1237 0%
Page cache 1302146 5086 1%
Free (cachelist) 365420 1427 0%
Free (freelist) 46421258 181333 35%
Total 134209791 524256
Physical 134209789 524256
root@EVT22200005:/alan# kstat -p | grep anon_alloc
memory_cap:0:global:anon_alloc_fail 0
memory_cap:1:sidecar_softnpu:anon_alloc_fail 0
memory_cap:2:oxz_switch:anon_alloc_fail 0
memory_cap:3:oxz_internal_dns_bb8352b6-e79d:anon_alloc_fail 0
memory_cap:4:oxz_internal_dns_f0141888-529d:anon_alloc_fail 0
memory_cap:5:oxz_internal_dns_3bf01041-c387:anon_alloc_fail 0
memory_cap:6:oxz_ntp_619a27b2-27f6-4306-a73:anon_alloc_fail 0
memory_cap:7:oxz_crucible_pantry_8316d541-4:anon_alloc_fail 0
memory_cap:8:oxz_crucible_pantry_593166c4-9:anon_alloc_fail 0
memory_cap:9:oxz_crucible_06f535c7-bdd5-458:anon_alloc_fail 0
memory_cap:10:oxz_crucible_271981d4-482a-41b:anon_alloc_fail 0
memory_cap:11:oxz_external_dns_2b243a9e-dfdf:anon_alloc_fail 0
memory_cap:12:oxz_crucible_18e85760-0da4-436:anon_alloc_fail 0
memory_cap:13:oxz_crucible_52624c54-730e-434:anon_alloc_fail 0
memory_cap:14:oxz_external_dns_147ab7b3-99eb:anon_alloc_fail 0
memory_cap:15:oxz_crucible_248a6753-ccc1-4dc:anon_alloc_fail 0
memory_cap:16:oxz_crucible_pantry_1ae7c800-5:anon_alloc_fail 0
memory_cap:17:oxz_crucible_a2f16190-d0cf-43a:anon_alloc_fail 0
memory_cap:18:oxz_crucible_6b82f85e-cdeb-43d:anon_alloc_fail 0
memory_cap:19:oxz_oximeter_6a55dde1-341c-42c:anon_alloc_fail 0
memory_cap:20:oxz_nexus_80d3ed1f-2a5b-4096-b:anon_alloc_fail 0
memory_cap:21:oxz_nexus_e7ecba3d-af48-45ce-b:anon_alloc_fail 0
memory_cap:22:oxz_nexus_1bf7775d-d80c-46d7-8:anon_alloc_fail 0
memory_cap:23:oxz_cockroachdb_3a16bd49-5487-:anon_alloc_fail 0
memory_cap:24:oxz_cockroachdb_ae4afcf6-5b8a-:anon_alloc_fail 0
memory_cap:25:oxz_cockroachdb_56231e0f-1f5e-:anon_alloc_fail 0
memory_cap:26:oxz_cockroachdb_7f9a67ff-07d4-:anon_alloc_fail 0
memory_cap:27:oxz_clickhouse_73005eac-0b4e-4:anon_alloc_fail 0
memory_cap:28:oxz_cockroachdb_90a8e6de-ae5d-:anon_alloc_fail 0
root@EVT22200005:/alan# swap -lh
swapfile dev swaplo blocks free
/dev/zvol/dsk/oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap 171,1 4K 64.00G 64.00G
root@EVT22200005:/alan# swap -sh
total: 19.69G allocated + 2.18G reserved = 21.87G used, 234.31G available
So, I think we are failing in lwp_create()
Two things that tell me this,
1) If I look at kstats for this zone, when I get a failure I see a counter being bumped:
root@oxz_crucible_06f535c7:~# kstat zones:9:
module: zones instance: 9
name: oxz_crucible_06f535c7-bdd5-458 class: zone_misc
avenrun_15min 0
avenrun_1min 0
avenrun_5min 0
boot_time 1708998571
crtime 958.623229280
forkfail_cap 0
forkfail_misc 0
forkfail_nomem 1121 <---- This counter goes up
forkfail_noproc 0
init_pid 13796
nested_interp 0
nsec_sys 139514148222
nsec_user 46274023842
nsec_waitrq 6899942910
snaptime 169666.815724164
zonename oxz_crucible_06f535c7-bdd5-4582-9582-5e420725a010
And, if I look at DTrace probes, I can see it fire when my calling shell PID returns NULL for the call
to lwp_create()
Get the PID of the calling shell
root@oxz_crucible_06f535c7:~# echo $$
29098
Use that to trace just children of the shell where I will run the program that reproduces the issue.
root@EVT22200005:/alan# dtrace -n 'fbt::lwp_create:return/progenyof(29098)/ { @[execname,arg1 != NULL] = count(); }'
dtrace: description 'fbt::lwp_create:return' matched 1 probe
^C
crucible-downsta. 0 12
bash 1 16
crucible-downsta. 1 504
So, we are pretty sure lwp_create() is failing, now we need to know why.
Looking at the lwp_create() code in question there are two places where that counter is bumped:
} else if (stksize == lwp_default_stksize) {
/*
* Try to reuse an <lwp,stack> from the LWP deathrow.
*/
if (lwp_reapcnt > 0) {
mutex_enter(&reaplock);
if ((t = lwp_deathrow) != NULL) {
ASSERT(t->t_swap);
lwp_deathrow = t->t_forw;
lwp_reapcnt--;
lwpdata = t->t_swap;
lwp = t->t_lwp;
ctx = t->t_ctx;
t->t_swap = NULL;
t->t_lwp = NULL;
t->t_ctx = NULL;
reapq_move_lq_to_tq(t);
}
mutex_exit(&reaplock);
if (lwp != NULL) {
lwp_stk_fini(lwp);
}
if (ctx != NULL) {
freectx_ctx(ctx);
}
}
if (lwpdata == NULL &&
(lwpdata = (caddr_t)segkp_cache_get(segkp_lwp)) == NULL) {
mutex_enter(&p->p_lock);
mutex_enter(&p->p_zone->zone_nlwps_lock);
p->p_task->tk_nlwps--;
p->p_task->tk_proj->kpj_nlwps--;
p->p_zone->zone_nlwps--;
mutex_exit(&p->p_zone->zone_nlwps_lock);
mutex_exit(&p->p_lock);
atomic_inc_32(&p->p_zone->zone_ffnomem);
return (NULL);
}
} else {
stksize = roundup(stksize, PAGESIZE);
if ((lwpdata = (caddr_t)segkp_get(segkp, stksize,
(KPD_NOWAIT | KPD_HASREDZONE | KPD_LOCKED))) == NULL) {
mutex_enter(&p->p_lock);
mutex_enter(&p->p_zone->zone_nlwps_lock);
p->p_task->tk_nlwps--;
p->p_task->tk_proj->kpj_nlwps--;
p->p_zone->zone_nlwps--;
mutex_exit(&p->p_zone->zone_nlwps_lock);
mutex_exit(&p->p_lock);
atomic_inc_32(&p->p_zone->zone_ffnomem);
return (NULL);
}
}
Contiued. We need to determine which place in lwp_create()
is where our
error happens. (thanks @rmustacc for the help with these DTrace calls)
Reproducing our issue while running this DTrace one liner tells us which path we have taken:
dtrace -n 'fbt::segkp_cache_get:return,fbt::segkp_get:return/progenyof(29098)/ { @[probefunc,arg1 != NULL] = count(); }'
dtrace: description 'fbt::segkp_cache_get:return,fbt::segkp_get:return' matched 2 probes
^C
segkp_cache_get 0 71
segkp_cache_get 1 185
So, from that we only see segkp_cache_get
and no segkp_get
So, how many threads does each downstairs have?
root@EVT22200005:/alan# pflags 12587 | wc -l
133
Roughly, 133. And, how many downstairs are running here?:
root@EVT22200005:~# ps -ef | grep crucible-downstairs | wc -l
505
So, 505*133=67165 threads for just crucible downstairs.
What is the actual count in the kernel:
root@EVT22200005:/alan# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic apix cpc crypto mm random smbios zfs sata ip hook neti sockfs lofs vmm scsi_vhci arp ufs ipcc logindmux nsmb ptm ]
> ::walk thread ! wc -l
83412
While we are here, lets look at a specific line from kmastat:
> ::kmastat
... snip ...
name in use total import succeed fail
----- ----------- ----------- ------ -------- ----
segkp 2075279360B 2147483648B 0B 24837798 1419
We are running out of segkp
, and that it causing things to fail.
Either the downstairs needs to go on a thread diet, or we need to prevent more than
500 total downstairs from running on a single sled, or, we need more segkp
s.
A specifc next step issue resulting from this is here: https://github.com/oxidecomputer/crucible/issues/1184
Fixed in 1189
Running on a bench gimlet, I setup a loop of
oxide disk create ...
and let it run till something broke. I was able to create ~150 1 GiB disks before I started getting errors.Once I was getting errors, I found I was not able to delete most disks either, though other things did appear to still be working in the UI and the CLI.