oxidecomputer / propolis

VMM userspace for illumos bhyve
Mozilla Public License 2.0
178 stars 22 forks source link

Propolis server panicked after hitting some `????????` rust_panic call stack #755

Open askfongjojo opened 2 months ago

askfongjojo commented 2 months ago

@augustuswm found a VM that had a propolis server panic. Here are the propolis log lines at the time:

18:41:28.222Z INFO propolis-server (vcpu_tasks): vCPU released from hold
    vcpu = 0
18:41:28.222Z INFO propolis-server (vm_state_worker): publishing new instance state
    gen = 9
    migration = InstanceMigrateStatusResponse { migration_in: None, migration_out: None }
    state = Running
18:41:28.222Z INFO propolis-server (vm_state_worker): State worker handled event
    outcome = Continue
18:41:28.222Z INFO propolis-server (vcpu_tasks): vCPU released from hold
    vcpu = 1
18:41:28.269Z INFO propolis-server: request completed
    latency_us = 55
    local_addr = [fd00:1122:3344:116::1:a6]:12400
    method = GET
    remote_addr = [fd00:1122:3344:116::1]:52233
    req_id = 8c5001b9-e06d-4606-acae-3fe5ab9cfdb0
    response_code = 200
    uri = /instance/state-monitor
18:41:28.637Z INFO propolis-server (ramfb): ramfb change
    config = Config { addr: 3198525440, fourcc: 875713112, flags: 0, width: 800, height: 600, stride: 3200 }
    state = valid
18:41:28.637Z INFO propolis-server (ramfb): notifying
18:41:28.637Z INFO propolis-server (vnc-server): pixel format set to fourcc=0x34325258
thread 'vcpu-0' panicked at bin/propolis-server/src/lib/vm/mod.rs:386:9:
vCPU 0: Unhandled VM exit: InstEmul(InstEmul { inst_data: [175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175], len: 15 })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Aug 12 18:41:34 Stopping because all processes in service exited. ]
[ Aug 12 18:41:34 Executing stop method (:kill). ]

Both the core file and propolis log can be found in /staff/rack3/BRM42220077/2024-08-29 (system-illumos-propolis-server:default.log.1723488297 and core.oxz_propolis-server_cdcb2537-a5ee-4e9d-97d8-d8a86b57d2dd.propolis-server.23389.1723488093.

@leftwo got the stack from the core file:

 fffff5ffef15b6ea _lwp_kill () + a                                                                                   
 fffff5ffef0ec6b2 raise (6) + 22                                                                                     
 fffff5ffef0c5dc8 abort () + 58                                                                                      
 0000000002592cd9 ???????? ()                                                                                        
 0000000002592cc9 ???????? ()                                                                                        
 000000000257b97d rust_panic () + d                                                                                  
 000000000257b636 std::panicking::rust_panic_with_hook::h0c682abf0664e68a () + 296                                   
 000000000257b362 std::panicking::begin_panic_handler::{{closure}}::h0d97730e7ba5f708 () + a2                        
 0000000002578db9 ???????? ()                             
 000000000257b0a6 ???????? ()                                                                                        
 00000000025bf522 ???????? ()                                                                                        
 00000000011957c8 ???????? ()                                                                                        
 000000000110f5ea ???????? ()                                                                                        
 000000000111412e core::ops::function::FnOnce::call_once{{vtable.shim}}::h19c41ab30f52a8dc () + 8e                   
 0000000002583dcb std::sys::pal::unix::thread::Thread::new::thread_start::h9ef84bfafda8297d () + 1b                  
 fffff5ffef154307 _thrp_setup (fffff5ffeef35240) + 77                                                                
 fffff5ffef154650 _lwp_start ()
askfongjojo commented 2 months ago

Also took a look at the crucible downstairs around that time and didn't see anything out of the ordinary.

root@oxz_switch1:~# omdb db disks list | grep 8a713b03-e6a2-4659-90f1-d3fce8a071ef
master-027138b2-8f92-4d70-8b31-59a-e9de02                    e45db0db-f9d8-47e6-9be1-3aef84da229f 10 GiB  attached                   8a713b03-e6a2-4659-90f1-d3fce8a071ef 

root@oxz_switch1:~# omdb db disks info e45db0db-f9d8-47e6-9be1-3aef84da229f
HOST_SERIAL DISK_NAME                                 INSTANCE_NAME PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE 
BRM42220077 master-027138b2-8f92-4d70-8b31-59a-e9de02 master        oxz_propolis-server_cdcb2537-a5ee-4e9d-97d8-d8a86b57d2dd 10fac653-2a79-415b-bdf6-51703b502ac9 attached   
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK                        
BRM42220035 b5b90b8d-53b7-410a-81f2-f99bc9c41c2b oxz_crucible_3e598962-ef8c-4cb6-bdfe-ec8563939d6a 44af7b26-cb4f-4a38-bbbf-700159e4252a 
BRM42220013 ee2c2eda-9fc7-4e25-83f9-773fd33d478b oxz_crucible_475140fa-a5dc-4ec1-876d-751c48adfc37 d5345689-fdc8-4009-bef8-978d2492f302 
BRM42220089 60bb712a-798b-43e2-87e2-dae6a68f94fa oxz_crucible_de376149-aa45-4660-9ae6-15e8ba4a4233 517b2733-1261-4a3c-813b-f92a7b5eec66 

VCR from volume ID 10fac653-2a79-415b-bdf6-51703b502ac9
ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT 
e45db0db-f9d8-47e6-9be1-3aef84da229f 512 1           false            

SUB VOLUME 0
    ID                                   BS  BPE    EC  GEN READ_ONLY 
    e45db0db-f9d8-47e6-9be1-3aef84da229f 512 131072 160 3   false     
    [fd00:1122:3344:10c::5]:19000
    [fd00:1122:3344:111::a]:19011
    [fd00:1122:3344:108::4]:19006
BRM42220089 # cat /pool/ext/dbfdc981-1b81-4d7d-9449-9530890b199a/crypt/debug/oxz_crucible_de376149-aa45-4660-9ae6-15e8ba4a4233/oxide-crucible-downstairs:downstairs-60bb712a-798b-43e2-87e2-dae6a68f94fa.log.1723488094 | looker
18:38:53.998Z INFO crucible: upstairs disconnected
    id = 0
    task = recv
18:38:53.998Z WARN crucible: recv_task sending ConnectionClosed
    id = 0
    task = recv
18:38:53.998Z INFO crucible: connection closed; disconnection
18:38:53.998Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) removed, 0 jobs left
18:38:53.998Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) was previously active, clearing
18:39:39.434Z INFO crucible: accepted connection from [fd00:1122:3344:116::1:a6]:56076
    task = main
18:39:39.434Z INFO crucible: connection ([fd00:1122:3344:116::1:a6]:56076): tasks spawned
    id = 1
18:39:39.434Z INFO crucible: Connection request from e45db0db-f9d8-47e6-9be1-3aef84da229f with version 8
18:39:39.434Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } connected, version 8
18:39:39.434Z INFO crucible: UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } is now active (read-write)
18:39:39.443Z INFO crucible: Current flush_numbers [0..12]: [276, 306, 335, 364, 394, 423, 452, 482, 511, 540, 570, 599]
18:41:34.066Z INFO crucible: upstairs disconnected
    id = 1
    task = recv
18:41:34.066Z WARN crucible: recv_task sending ConnectionClosed
    id = 1
    task = recv
18:41:34.066Z INFO crucible: connection closed; disconnection
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) removed, 0 jobs left
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) was previously active, clearing
BRM42220013 # cat /pool/ext/a1856609-b49f-4d30-a0e3-2e8154dbcdbf/crypt/debug/oxz_crucible_475140fa-a5dc-4ec1-876d-751c48adfc37/oxide-crucible-downstairs:downstairs-ee2c2eda-9fc7-4e25-83f9-773fd33d478b.log.1723488094 | looker
18:38:53.998Z INFO crucible: upstairs disconnected
    id = 0
    task = recv
18:38:53.998Z WARN crucible: recv_task sending ConnectionClosed
    id = 0
    task = recv
18:38:53.998Z INFO crucible: connection closed; disconnection
18:38:53.998Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) removed, 0 jobs left
18:38:53.998Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) was previously active, clearing
18:39:39.434Z INFO crucible: accepted connection from [fd00:1122:3344:116::1:a6]:55290
    task = main
18:39:39.434Z INFO crucible: connection ([fd00:1122:3344:116::1:a6]:55290): tasks spawned
    id = 1
18:39:39.434Z INFO crucible: Connection request from e45db0db-f9d8-47e6-9be1-3aef84da229f with version 8
18:39:39.434Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } connected, version 8
18:39:39.434Z INFO crucible: UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } is now active (read-write)
18:39:39.486Z INFO crucible: Current flush_numbers [0..12]: [276, 306, 335, 364, 394, 423, 452, 482, 511, 540, 570, 599]
18:41:34.066Z INFO crucible: upstairs disconnected
    id = 1
    task = recv
18:41:34.066Z WARN crucible: recv_task sending ConnectionClosed
    id = 1
    task = recv
18:41:34.066Z INFO crucible: connection closed; disconnection
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) removed, 0 jobs left
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) was previously active, clearing
BRM42220035 # cat /pool/ext/95220093-e3b8-4f7f-9f5a-cb32cb75180a/crypt/debug/oxz_crucible_3e598962-ef8c-4cb6-bdfe-ec8563939d6a/oxide-crucible-downstairs:downstairs-b5b90b8d-53b7-410a-81f2-f99bc9c41c2b.log.1723488094 | looker
18:38:54.001Z INFO crucible: upstairs disconnected
    id = 0
    task = recv
18:38:54.001Z WARN crucible: recv_task sending ConnectionClosed
    id = 0
    task = recv
18:38:54.001Z INFO crucible: connection closed; disconnection
18:38:54.001Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) removed, 0 jobs left
18:38:54.001Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) was previously active, clearing
18:39:39.434Z INFO crucible: accepted connection from [fd00:1122:3344:116::1:a6]:44735
    task = main
18:39:39.434Z INFO crucible: connection ([fd00:1122:3344:116::1:a6]:44735): tasks spawned
    id = 1
18:39:39.434Z INFO crucible: Connection request from e45db0db-f9d8-47e6-9be1-3aef84da229f with version 8
18:39:39.434Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } connected, version 8
18:39:39.434Z INFO crucible: UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } is now active (read-write)
18:39:39.444Z INFO crucible: Current flush_numbers [0..12]: [276, 306, 335, 364, 394, 423, 452, 482, 511, 540, 570, 599]
18:41:34.066Z INFO crucible: upstairs disconnected
    id = 1
    task = recv
18:41:34.066Z WARN crucible: recv_task sending ConnectionClosed
    id = 1
    task = recv
18:41:34.066Z INFO crucible: connection closed; disconnection
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) removed, 0 jobs left
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) was previously active, clearing
pfmooney commented 2 months ago

Looking at the opcode(s), it would seem that the guest vCPU jumped off into space. This lends more credence to the necessity of #335.