oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
167 stars 18 forks source link

Crucible agent dies under load with: Stopping because all processes in service exited #1177

Closed leftwo closed 7 months ago

leftwo commented 7 months ago

Running on a bench gimlet, I setup a loop of oxide disk create ... and let it run till something broke. I was able to create ~150 1 GiB disks before I started getting errors.

Once I was getting errors, I found I was not able to delete most disks either, though other things did appear to still be working in the UI and the CLI.

leftwo commented 7 months ago

The bench gimlet is using 7 of the real SSDs in the box, and has a crucible zone for each of these 7.

Some crucible zones do report that services have failed:

root@EVT22200005:/alan# svcs -xZ
svc:/oxide/crucible/downstairs:downstairs-9d65c793-105b-49e3-bbeb-440afe05bc24 (Oxide Crucible Downstairs)
  Zone: oxz_crucible_18e85760-0da4-4369-943d-9927e7569d54
 State: maintenance since Wed Feb 21 02:36:34 2024
Reason: Method failed repeatedly.
   See: http://illumos.org/msg/SMF-8000-8Q
   See: /pool/ext/31bd71cd-4736-4a12-a387-9b74b050396f/crypt/zone/oxz_crucible_18e85760-0da4-4369-943d-9927e7569d54/root/var/svc/log/oxide-crucible-downstairs:downstairs-9d65c793-105b-49e3-bbeb-440afe05bc24.log
Impact: This service is not running.

svc:/network/physical:default (physical network interfaces)
  Zone: oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a
 State: offline since Wed Feb 21 02:43:24 2024
Reason: Start method is running.
   See: http://illumos.org/msg/SMF-8000-C4
   See: ifconfig(8)
   See: /pool/ext/f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a/root/var/svc/log/network-physical:default.log
Impact: This service is not running.

svc:/oxide/crucible/agent:default (Oxide Crucible Downstairs)
  Zone: oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a
 State: offline since Wed Feb 21 02:43:24 2024
Reason: Start method is running.
   See: http://illumos.org/msg/SMF-8000-C4
   See: /pool/ext/f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a/root/var/svc/log/oxide-crucible-agent:default.log
Impact: This service is not running.

svc:/oxide/crucible/downstairs:downstairs-94ac7ea8-f434-41b9-8389-771a2f9754c4 (Oxide Crucible Downstairs)
  Zone: oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a
 State: maintenance since Wed Feb 21 02:36:34 2024
Reason: Method failed repeatedly.
   See: http://illumos.org/msg/SMF-8000-8Q
   See: /pool/ext/f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a/root/var/svc/log/oxide-crucible-downstairs:downstairs-94ac7ea8-f434-41b9-8389-771a2f9754c4.log
Impact: This service is not running.

I would expect problems trying to delete a disk if the agent had died that contained one of that disks downstairs.

From the above, I see one zone with a failed agent: oxz_crucible_a2f16190-d0cf-43a7-87bb-49bbddfa5c9a I see another zone with a failed downstairs: svc:/oxide/crucible/downstairs:downstairs-9d65c793-105b-49e3-bbeb-440afe05bc24 This second one is in Zone: oxz_crucible_18e85760-0da4-4369-943d-9927e7569d54

leftwo commented 7 months ago

Agent logs at: /staff/core/issues/crucible-1177

Looking at the agent log, I can see disk creations flowing through it.

However, at this spot, we have a problem:

02:36:50.006Z INFO crucible-agent (datafile): region 1df00af0-43d8-4f3e-bfb9-5a687b1247d7 state: Requested
02:36:50.010Z INFO crucible-agent (dropshot): request completed
    latency_us = 5085
    local_addr = [fd00:1122:3344:101::18]:32345
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:63016
    req_id = 3b94b433-aaf2-497d-a041-28a1c50982f6
    response_code = 200
    uri = /crucible/0/regions
02:36:50.010Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
02:36:50.030Z INFO crucible-agent (worker): zfs set reservation of 1342177280 for oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:50.030Z INFO crucible-agent (worker): zfs set quota of 3221225472 for oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:50.122Z INFO crucible-agent (worker): creating region Region { id: RegionId("1df00af0-43d8-4f3e-bfb9-5a687b1247d7"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19075, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7"
    region = 1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:50.215Z INFO crucible-agent (dropshot): request completed
    latency_us = 176
    local_addr = [fd00:1122:3344:101::18]:32345
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:63016
    req_id = d71f80f6-e867-4037-93b9-fc192119143a
    response_code = 200
    uri = /crucible/0/regions
[ Feb 21 02:36:50 Stopping because all processes in service exited. ]
[ Feb 21 02:36:50 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/agent.sh"). ]

We were creating a region, then something happened to the agent and the service restarted. I did not find any core files anywhere in the zone (or the system at large) so there might be a bug in the coreadm setup somewhere.. Back to the agent, it starts right back up, accepts another request to create a region, then dies:

02:36:51.020Z INFO crucible-agent: listen IP: [fd00:1122:3344:101::18]:32345
02:36:51.020Z INFO crucible-agent: SMF instance name downstairs_prefix: "downstairs"
02:36:51.178Z INFO crucible-agent (dropshot): listening
    local_addr = [fd00:1122:3344:101::18]:32345
02:36:51.187Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
02:36:51.223Z INFO crucible-agent (worker): creating region Region { id: RegionId("1df00af0-43d8-4f3e-bfb9-5a687b1247d7"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19075, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7"
    region = 1df00af0-43d8-4f3e-bfb9-5a687b1247d7
[ Feb 21 02:36:51 Stopping because all processes in service exited. ]
[ Feb 21 02:36:51 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/agent.sh"). ]

The agent restarts again, and this time is able to create and destroy a few regions before again running into trouble:

02:36:52.085Z INFO crucible-agent: SMF instance name downstairs_prefix: "downstairs"
02:36:52.239Z INFO crucible-agent (dropshot): listening
    local_addr = [fd00:1122:3344:101::18]:32345
02:36:52.249Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
02:36:52.285Z INFO crucible-agent (worker): creating region Region { id: RegionId("1df00af0-43d8-4f3e-bfb9-5a687b1247d7"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19075, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/1df00af0-43d8-4f3e-bfb9-5a687b1247d7"
    region = 1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:52.367Z INFO crucible-agent (worker): region files created ok
    region = 1df00af0-43d8-4f3e-bfb9-5a687b1247d7
02:36:52.367Z INFO crucible-agent (datafile): region 1df00af0-43d8-4f3e-bfb9-5a687b1247d7 state: Requested -> Created

...

02:37:22.017Z INFO crucible-agent (datafile): region e327420e-dd0c-4508-a12f-22b894e15289 state: Requested
02:37:22.020Z INFO crucible-agent (dropshot): request completed
    latency_us = 4907
    local_addr = [fd00:1122:3344:101::18]:32345
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:50626
    req_id = 0e137e78-90f9-4c23-8eef-64211d65ad70
    response_code = 200
    uri = /crucible/0/regions
02:37:22.021Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
02:37:22.040Z INFO crucible-agent (worker): zfs set reservation of 1342177280 for oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/e327420e-dd0c-4508-a12f-22b894e15289
02:37:22.040Z INFO crucible-agent (worker): zfs set quota of 3221225472 for oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/e327420e-dd0c-4508-a12f-22b894e15289
02:37:22.128Z INFO crucible-agent (worker): creating region Region { id: RegionId("e327420e-dd0c-4508-a12f-22b894e15289"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19075, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/e327420e-dd0c-4508-a12f-22b894e15289"
    region = e327420e-dd0c-4508-a12f-22b894e15289
02:37:22.187Z INFO crucible-agent (dropshot): request completed
    latency_us = 106
    local_addr = [fd00:1122:3344:101::18]:32345
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:50626
    req_id = eedae8c8-2716-4a18-82e2-06907f3c1a33
    response_code = 200
    uri = /crucible/0/regions
[ Feb 21 02:37:22 Stopping because all processes in service exited. ]
[ Feb 21 02:37:22 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/agent.sh"). ]
leftwo commented 7 months ago

Looking at the downstairs that failed, it looks like it was out of resources:

root@oxz_crucible_a2f16190:~# cat /var/svc/log/oxide-crucible-downstairs:downstair
s-94ac7ea8-f434-41b9-8389-771a2f9754c4.log.5                                      
[ Feb 21 02:36:34 Disabled. ]                                                     
[ Feb 21 02:36:34 Enabled. ]        
[ Feb 21 02:36:34 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]
thread 'main' panicked at /home/alan/.cargo/registry/src/index.crates.io-6f17d22bb
a15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:447:13:
OS can't spawn worker thread: Resource temporarily unavailable (os error 11)  
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace 
[ Feb 21 02:36:34 Stopping because all processes in service exited. ] 
leftwo commented 7 months ago

After an update to Omicron and helios bits, I slowly created disks till I saw a failure.

Initially, I was not get a core file when the agent would dump core. I update coreadm inside the crucible zone and set it to dump core in /tmp and I was able to generate a few cores.

From one, here is the stack trace:

> ::status
debugging core file of crucible-downst (64-bit) from oxz_crucible_52624c54-730e-434b-a8fe-bc274b0e1068
file: /opt/oxide/crucible/bin/crucible-downstairs
initial argv: /opt/oxide/crucible/bin/crucible-downstairs create --uuid a5ea81c0-cb1e-40e5-95
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=7137 uid=0 code=-1
> $C ! /tmp/demangle
fffff9ffffdfab70 libc.so.1`_lwp_kill+0xa()
fffff9ffffdfaba0 libc.so.1`raise+0x22(6)
fffff9ffffdfabf0 libc.so.1`abort+0x58()
fffff9ffffdfac00 panic_abort::__rust_start_panic::abort::h97b10551a860ee16+9()
fffff9ffffdfac10 __rust_start_panic+9()
fffff9ffffdfac70 rust_panic+0xd()
fffff9ffffdfad30 std::panicking::rust_panic_with_hook::h1dc68ec7979d63c5+0x2af()
fffff9ffffdfad70 std::panicking::begin_panic_handler::{{closure}}::h73dc9943db337240+0xa0()
fffff9ffffdfad80 std::sys_common::backtrace::__rust_end_short_backtrace::hd0b2ffcbdcddf45c+9()
fffff9ffffdfadb0 rust_begin_unwind+0x44()
fffff9ffffdfadf0 core::panicking::panic_fmt::hf2cd841661acd9ba+0x35()
fffff9ffffdfae70 core::result::unwrap_failed::hd6b81b0a1c6c351a+0x75()
fffff9ffffdfaf50 slog_async::AsyncCoreBuilder<D>::build_no_guard::h3a7942f4d314a4e8+0x1f4()
fffff9ffffdfb4c0 crucible_common::build_logger::hc3ee6a1c9781a657+0x296()
fffff9ffffdfcba0 crucible_downstairs::main::_$u7b$$u7b$closure$u7d$$u7d$::h95ff0b004965dd2d (.llvm.2801909585092955357)+0xf3()
fffff9ffffdfd740 tokio::runtime::park::CachedParkThread::block_on::h4f390498fd03c179+0x99()
fffff9ffffdfe320 tokio::runtime::context::runtime::enter_runtime::hd57aa73e2b8e1b3e+0x145()
fffff9ffffdfeeb0 tokio::runtime::runtime::Runtime::block_on::he75572bbeaa878ca+0x5d()
fffff9ffffdffb80 crucible_downstairs::main::hd590936a926cbd1f+0x94()
fffff9ffffdffb90 std::sys_common::backtrace::__rust_begin_short_backtrace::h2b86b629e7b17e25+6()
fffff9ffffdffbf0 std::rt::lang_start::{{closure}}::hb1df1666a90959b0+0x11()
fffff9ffffdffcd0 std::rt::lang_start_internal::h8fd2fe8457b30241+0x2c9()
fffff9ffffdffcf0 std::rt::lang_start::h132a478d446dfe70+0x25()
fffff9ffffdffd20 _start_crt+0x87()
fffff9ffffdffd30 _start+0x18()

Note that while I did get a core file, I did not get any messages in the agent log indicating there was a core, other than seeing that the service had restarted.

Here is the agent log before and after the core:

05:07:43.952Z INFO crucible-agent (dropshot): listening
    local_addr = [fd00:1122:3344:101::15]:32345
05:07:43.961Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
05:07:43.996Z INFO crucible-agent (worker): creating region Region { id: RegionId("a5ea81c0-cb1e-40e5-9583-6a01c40a2736"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 16, encrypted: true, port_number: 19062, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/a5ea81c0-cb1e-40e5-9583-6a01c40a2736"
    region = a5ea81c0-cb1e-40e5-9583-6a01c40a2736
05:07:43.996Z INFO crucible-agent (worker): command is: env -i "/opt/oxide/crucible/bin/crucible-downstairs" "create" "--uuid" "a5ea81c0-cb1e-40e5-9583-6a01c40a2736" "--data" "/data/regions/a5ea81c0-cb1e-40e5-9583-6a01c40a2736" "--block-size" "4096" "--extent-size" "16384" "--extent-count" "16" "--encrypted"
    region = a5ea81c0-cb1e-40e5-9583-6a01c40a2736
[ Feb 24 05:07:44 Stopping because all processes in service exited. ]

(Note the log line with "command is: " was added for debugging and is not present in the source.

leftwo commented 7 months ago

I'm able to, with sometimes multiple tries, reproduce a failure in crucible-downstairs create which is one of the things the crucible-agent will call as part of a sub process.

If I truss the failing command, it looks like this:

root@oxz_crucible_06f535c7:~# truss /opt/oxide/crucible/bin/crucible-downstairs create --uuid 9403b3c6-a59c-4209-b765-421a4c1886f3 --data /data/regions/9403b3c6-a59c-4209-b765-421a4c1886f3 --block-size 4096 --extent-size 16384 --extent-count 240 --encrypted

And one of the failures looks like this:

/102:   yield()                                         = 0
/104:   sigaltstack(0xFFFFF9FFD3DFFF30, 0x00000000)     = 0
/103:   yield()                                         = 0
/102:   yield()                                         = 0
/100:   yield()                                         = 0
/103:   yield()                                         = 0
/100:   yield()                                         = 0
/101:   open("/proc/self/lwp/101/lwpname", O_WRONLY)    = 10
/104:   open("/proc/self/lwp/104/lwpname", O_WRONLY)    = 11
/103:   yield()                                         = 0
/102:   yield()                                         = 0
/101:   write(10, " t o k i o - r u n t i m".., 21)     = 21
/104:   write(11, " t o k i o - r u n t i m".., 21)     = 21
/101:   close(10)                                       = 0
/104:   close(11)                                       = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) = 105
truss: cannot create lwp to follow child lwp

The other failure I see is from crucible-downstairs. Its truss output looks like this:

/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0                                    
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0              
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN                                                                                                             
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0                                    
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0                                  
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0              
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN           
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0                                                                                                                                      
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN           
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN           
/1:     uucopy(0xFFFFF9FFFFDFE130, 0xFFFFF9FFE01FFFE8, 24) = 0
/1:     lwp_create(0xFFFFF9FFFFDFE240, LWP_SUSPENDED, 0xFFFFF9FFFFDFE23C) Err#11 EAGAIN
/1:     brk(0x02135000)                                 = 0                 
/1:     ioctl(1, TCGETA, 0xFFFFF9FFFFDFADB0)            = 0        
/1:     stat("/root/.terminfo", 0xFFFFF9FFFFDFA770)     Err#2 ENOENT
/1:     stat("/etc/terminfo", 0xFFFFF9FFFFDFA770)       Err#2 ENOENT                              
/1:     stat("/lib/terminfo", 0xFFFFF9FFFFDFA770)       Err#2 ENOENT
/1:     stat("/usr/share/terminfo", 0xFFFFF9FFFFDFA770) = 0        
/1:     stat("/usr/share/terminfo/t/tmux-256color", 0xFFFFF9FFFFDFA770) Err#2 ENOENT
/1:     stat("/usr/share/terminfo/74/tmux-256color", 0xFFFFF9FFFFDFA770) Err#2 ENOENT
/1:     stat("/boot/system/data/terminfo", 0xFFFFF9FFFFDFA770) Err#2 ENOENT
/1:     ioctl(2, TCGETA, 0xFFFFF9FFFFDFAD70)            = 0                                       
/1:     brk(0x02145000)                                 = 0                            
/1:     uucopy(0xFFFFF9FFFFDFA6B0, 0xFFFFF9FFE01FFFE8, 24) = 0              
/1:     lwp_create(0xFFFFF9FFFFDFA7C0, LWP_SUSPENDED, 0xFFFFF9FFFFDFA7BC) Err#11 EAGAIN                                                                                                             
/1:     mmap(0x00010000, 65536, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON|MAP_ALIGN, 4294967295, 0) = 0xFFFFF9FFEEBB0000
thread '/1:     write(2, " t h r e a d   '", 8)                 = 8                                                                                                                                 
main/1: write(2, " m a i n", 4)                         = 4
' panicked at /1:       write(2, " '   p a n i c k e d   a".., 14)      = 14
/home/alan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/slog-async-2.8.0/lib.rs/1:      write(2, " / h o m e / a l a n / .".., 87)      = 87
:/1:    write(2, " :", 1)                               = 1
319/1:  write(2, " 3 1 9", 3)                           = 3
:/1:    write(2, " :", 1)                               = 1
14/1:   write(2, " 1 4", 2)                             = 2
:
/1:     write(2, " :\n", 2)                             = 2
called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }/1:  write(2, " c a l l e d   ` R e s u".., 123)     = 123

/1:     write(2, "\n", 1)                               = 1
/1:     getcwd("/root", 512)                            = 0
stack backtrace:

(Some of the unwinding of the stack for the panic has been omitted)

leftwo commented 7 months ago

Some system info collected while the situation was reproduced:

root@EVT22200005:/alan# mdb -ke ::memstat 
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    7362138             28758    5%
Boot pages                     13                 0    0%
ZFS File Data             7805897             30491    6%
VMM Memory               67106304            262134   50%
Anon                      3529813             13788    3%
Exec and libs              316802              1237    0%
Page cache                1302146              5086    1%
Free (cachelist)           365420              1427    0%
Free (freelist)          46421258            181333   35%

Total                   134209791            524256
Physical                134209789            524256
root@EVT22200005:/alan# kstat -p | grep anon_alloc
memory_cap:0:global:anon_alloc_fail     0
memory_cap:1:sidecar_softnpu:anon_alloc_fail    0
memory_cap:2:oxz_switch:anon_alloc_fail 0
memory_cap:3:oxz_internal_dns_bb8352b6-e79d:anon_alloc_fail     0
memory_cap:4:oxz_internal_dns_f0141888-529d:anon_alloc_fail     0
memory_cap:5:oxz_internal_dns_3bf01041-c387:anon_alloc_fail     0
memory_cap:6:oxz_ntp_619a27b2-27f6-4306-a73:anon_alloc_fail     0
memory_cap:7:oxz_crucible_pantry_8316d541-4:anon_alloc_fail     0
memory_cap:8:oxz_crucible_pantry_593166c4-9:anon_alloc_fail     0
memory_cap:9:oxz_crucible_06f535c7-bdd5-458:anon_alloc_fail     0
memory_cap:10:oxz_crucible_271981d4-482a-41b:anon_alloc_fail    0
memory_cap:11:oxz_external_dns_2b243a9e-dfdf:anon_alloc_fail    0
memory_cap:12:oxz_crucible_18e85760-0da4-436:anon_alloc_fail    0
memory_cap:13:oxz_crucible_52624c54-730e-434:anon_alloc_fail    0
memory_cap:14:oxz_external_dns_147ab7b3-99eb:anon_alloc_fail    0
memory_cap:15:oxz_crucible_248a6753-ccc1-4dc:anon_alloc_fail    0
memory_cap:16:oxz_crucible_pantry_1ae7c800-5:anon_alloc_fail    0
memory_cap:17:oxz_crucible_a2f16190-d0cf-43a:anon_alloc_fail    0
memory_cap:18:oxz_crucible_6b82f85e-cdeb-43d:anon_alloc_fail    0
memory_cap:19:oxz_oximeter_6a55dde1-341c-42c:anon_alloc_fail    0
memory_cap:20:oxz_nexus_80d3ed1f-2a5b-4096-b:anon_alloc_fail    0
memory_cap:21:oxz_nexus_e7ecba3d-af48-45ce-b:anon_alloc_fail    0
memory_cap:22:oxz_nexus_1bf7775d-d80c-46d7-8:anon_alloc_fail    0
memory_cap:23:oxz_cockroachdb_3a16bd49-5487-:anon_alloc_fail    0
memory_cap:24:oxz_cockroachdb_ae4afcf6-5b8a-:anon_alloc_fail    0
memory_cap:25:oxz_cockroachdb_56231e0f-1f5e-:anon_alloc_fail    0
memory_cap:26:oxz_cockroachdb_7f9a67ff-07d4-:anon_alloc_fail    0
memory_cap:27:oxz_clickhouse_73005eac-0b4e-4:anon_alloc_fail    0
memory_cap:28:oxz_cockroachdb_90a8e6de-ae5d-:anon_alloc_fail    0
root@EVT22200005:/alan# swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap 171,1        4K   64.00G   64.00G
root@EVT22200005:/alan# swap -sh
total: 19.69G allocated + 2.18G reserved = 21.87G used, 234.31G available
leftwo commented 7 months ago

So, I think we are failing in lwp_create() Two things that tell me this, 1) If I look at kstats for this zone, when I get a failure I see a counter being bumped:

root@oxz_crucible_06f535c7:~# kstat zones:9:
module: zones                           instance: 9     
name:   oxz_crucible_06f535c7-bdd5-458  class:    zone_misc
        avenrun_15min                   0
        avenrun_1min                    0
        avenrun_5min                    0
        boot_time                       1708998571
        crtime                          958.623229280
        forkfail_cap                    0
        forkfail_misc                   0
        forkfail_nomem                  1121     <---- This counter goes up
        forkfail_noproc                 0
        init_pid                        13796
        nested_interp                   0
        nsec_sys                        139514148222
        nsec_user                       46274023842
        nsec_waitrq                     6899942910
        snaptime                        169666.815724164
        zonename                        oxz_crucible_06f535c7-bdd5-4582-9582-5e420725a010

And, if I look at DTrace probes, I can see it fire when my calling shell PID returns NULL for the call to lwp_create()

Get the PID of the calling shell

root@oxz_crucible_06f535c7:~# echo $$ 
29098

Use that to trace just children of the shell where I will run the program that reproduces the issue.

root@EVT22200005:/alan# dtrace -n 'fbt::lwp_create:return/progenyof(29098)/ { @[execname,arg1 != NULL] = count(); }'
dtrace: description 'fbt::lwp_create:return' matched 1 probe
^C
crucible-downsta.           0              12
            bash            1              16
crucible-downsta.           1             504

So, we are pretty sure lwp_create() is failing, now we need to know why.

Looking at the lwp_create() code in question there are two places where that counter is bumped:

    } else if (stksize == lwp_default_stksize) {
        /*
         * Try to reuse an <lwp,stack> from the LWP deathrow.
         */
        if (lwp_reapcnt > 0) {
            mutex_enter(&reaplock);
            if ((t = lwp_deathrow) != NULL) {
                ASSERT(t->t_swap);
                lwp_deathrow = t->t_forw;
                lwp_reapcnt--;
                lwpdata = t->t_swap;
                lwp = t->t_lwp;
                ctx = t->t_ctx;
                t->t_swap = NULL;
                t->t_lwp = NULL;
                t->t_ctx = NULL;
                reapq_move_lq_to_tq(t);
            }
            mutex_exit(&reaplock);
            if (lwp != NULL) {
                lwp_stk_fini(lwp);
            }
            if (ctx != NULL) {
                freectx_ctx(ctx);
            }
        }
        if (lwpdata == NULL &&
            (lwpdata = (caddr_t)segkp_cache_get(segkp_lwp)) == NULL) {
            mutex_enter(&p->p_lock);
            mutex_enter(&p->p_zone->zone_nlwps_lock);
            p->p_task->tk_nlwps--;
            p->p_task->tk_proj->kpj_nlwps--;
            p->p_zone->zone_nlwps--;
            mutex_exit(&p->p_zone->zone_nlwps_lock);
            mutex_exit(&p->p_lock);
            atomic_inc_32(&p->p_zone->zone_ffnomem);
            return (NULL);
        }
    } else {
        stksize = roundup(stksize, PAGESIZE);
        if ((lwpdata = (caddr_t)segkp_get(segkp, stksize,
            (KPD_NOWAIT | KPD_HASREDZONE | KPD_LOCKED))) == NULL) {
            mutex_enter(&p->p_lock);
            mutex_enter(&p->p_zone->zone_nlwps_lock);
            p->p_task->tk_nlwps--;
            p->p_task->tk_proj->kpj_nlwps--;
            p->p_zone->zone_nlwps--;
            mutex_exit(&p->p_zone->zone_nlwps_lock);
            mutex_exit(&p->p_lock);
            atomic_inc_32(&p->p_zone->zone_ffnomem);
            return (NULL);
        }
    }
leftwo commented 7 months ago

Contiued. We need to determine which place in lwp_create() is where our error happens. (thanks @rmustacc for the help with these DTrace calls)

Reproducing our issue while running this DTrace one liner tells us which path we have taken:

 dtrace -n 'fbt::segkp_cache_get:return,fbt::segkp_get:return/progenyof(29098)/ { @[probefunc,arg1 != NULL] = count(); }'
dtrace: description 'fbt::segkp_cache_get:return,fbt::segkp_get:return' matched 2 probes
^C

  segkp_cache_get                                           0               71
  segkp_cache_get                                           1              185

So, from that we only see segkp_cache_get and no segkp_get

So, how many threads does each downstairs have?

root@EVT22200005:/alan# pflags 12587 | wc -l                                                      
     133      

Roughly, 133. And, how many downstairs are running here?:

root@EVT22200005:~# ps -ef | grep crucible-downstairs | wc -l
     505

So, 505*133=67165 threads for just crucible downstairs.

What is the actual count in the kernel:

root@EVT22200005:/alan# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic apix cpc crypto mm random smbios zfs sata ip hook neti sockfs lofs vmm scsi_vhci arp ufs ipcc logindmux nsmb ptm ]
> ::walk thread ! wc -l 
   83412

While we are here, lets look at a specific line from kmastat:

> ::kmastat    
... snip ...
 name       in use        total  import   succeed  fail      
-----  -----------  -----------  ------  --------  ----
segkp  2075279360B  2147483648B      0B  24837798  1419  

We are running out of segkp, and that it causing things to fail.

Either the downstairs needs to go on a thread diet, or we need to prevent more than 500 total downstairs from running on a single sled, or, we need more segkps.

leftwo commented 7 months ago

A specifc next step issue resulting from this is here: https://github.com/oxidecomputer/crucible/issues/1184

leftwo commented 7 months ago

Fixed in 1189