oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
159 stars 16 forks source link

Downstairs reports: database disk image is malformed #485

Open leftwo opened 1 year ago

leftwo commented 1 year ago

Core(s) generated while omicron was running with 187 disks and one instance. The instance had two disks attached.

Many core files were generated from omicron, with the 2nd oldest being this one from inside crucible.

Core file on catacomb at: /data/staff/core/gimlet-sn21/20221015/core.oxz_propolis-server_1b5ebd28-3a8f-4013-8927-99c200dbd501.propolis-server.24772

$ mdb core.oxz_propolis-server_1b5ebd28-3a8f-4013-8927-99c200dbd501.propolis-server.24772 
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> $C ! demangle
fffff5ffd07fe470 libc.so.1`_lwp_kill+0xa()
fffff5ffd07fe4a0 libc.so.1`raise+0x1e(6)
fffff5ffd07fe4f0 libc.so.1`abort+0x58()
fffff5ffd07fe500 panic_abort::__rust_start_panic::abort::h5e4ff9722552943b+9()
fffff5ffd07fe510 __rust_start_panic+9()
fffff5ffd07fe5c0 rust_panic+0x1c()
fffff5ffd07fe670 std::panicking::rust_panic_with_hook::ha80c5cb4a11249f7+0x1fe()
fffff5ffd07fe6c0 std::panicking::begin_panic_handler::{{closure}}::h1c465a251fe8dc5f+0xc5()
fffff5ffd07fe6f0 std::sys_common::backtrace::__rust_end_short_backtrace::h73c8afa9397a6a09+0x20()
fffff5ffd07fe730 rust_begin_unwind+0x71()
fffff5ffd07fe770 core::panicking::panic_fmt::h476086f4291cd29f+0x33()
fffff5ffd07fe850 core::panicking::assert_failed_inner::h1c6c74356de69ab3+0xca()
fffff5ffd07fe8b0 core::panicking::assert_failed::ha8afc05e959f3059+0x46()
fffff5ffd07feb00 crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4dd()
fffff5ffd07fed90 tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut::hf1fc0701748dd1a8+0xaa4()
fffff5ffd07ffb10 tokio::runtime::task::harness::Harness<T,S>::poll::ha036dcbe5e688841+0x58()
fffff5ffd07ffbc0 std::thread::local::LocalKey<T>::with::hfaf80dd10aedc955+0x1c8()
fffff5ffd07ffc40 tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h07f63c15f77994f4+0x21f()
fffff5ffd07ffd00 tokio::runtime::scheduler::multi_thread::worker::Context::run::ha897fdb5c3963887+0x72e()
fffff5ffd07ffd50 tokio::runtime::scheduler::multi_thread::worker::run::h2ec47ea0d8bd33fe+0x75()
fffff5ffd07ffd90 tokio::runtime::task::core::CoreStage<T>::poll::h0b439faa5b8bdc5a+0x3f()
fffff5ffd07ffe10 tokio::runtime::task::harness::Harness<T,S>::poll::hbf631161bda3ca94+0x49()
fffff5ffd07ffec0 tokio::runtime::blocking::pool::Inner::run::h24bd9cdb22619ff2+0xbf()
fffff5ffd07fff10 std::sys_common::backtrace::__rust_begin_short_backtrace::hee5d124e97635faa+0x109()
fffff5ffd07fff60 core::ops::function::FnOnce::call_once{{vtable.shim}}::h2fcf5432bb9edc09+0x78()
fffff5ffd07fffb0 std::sys::unix::thread::Thread::new::thread_start::h87fc76126371d726+0x27()
fffff5ffd07fffe0 libc.so.1`_thrp_setup+0x6c(fffff5ffee75a240)
fffff5ffd07ffff0 libc.so.1`_lwp_start()
leftwo commented 1 year ago

What in that retire_check is failing an assertion?

> ::nm ! grep retire
0x0000000001463c10|0x0000000000000592|FUNC |GLOB |0x0  |14      |_ZN8crucible10Downstairs12retire_check17h3b5d34ae9aea2742E
> 0000000001463c10::dis
crucible::Downstairs::retire_check::h3b5d34ae9aea2742:  pushq  %rbp
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+1:movq   %rsp,%rbp
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+4:pushq  %r15
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+6:pushq  %r14
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+8:pushq  %r13
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0xa:      pushq  %r12
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0xc:      pushq  %rbx
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0xd:      subq   $0x218,%rsp
...
...
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x474:    call   +0xf9b27 <core::panicking::assert_failed::h89a177d5655d43de>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x479:    ud2    
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x47b:    leaq   +0x91dc77(%rip),%rdi     <0x1d81d09>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x482:    leaq   +0xb0b6b7(%rip),%rdx     <0x1f6f750>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x489:    movl   $0x2e,%esi
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x48e:    call   +0x867aad        <core::panicking::panic::he64c1b71ae3b1370>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x493:    ud2    
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x495:    leaq   +0x91d329(%rip),%rdi     <anon.664ec91311d0e51b0a0a04e44da0d672.95.llvm.8491239946558363113>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x49c:    leaq   +0xb0b6cd(%rip),%rdx     <0x1f6f780>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4a3:    movl   $0x2b,%esi
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4a8:    call   +0x867a93        <core::panicking::panic::he64c1b71ae3b1370>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4ad:    ud2    
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4af:    movq   $0x0,0xfffffffffffffe80(%rbp)
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4ba:    leaq   0xfffffffffffffe78(%rbp),%rsi
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4c1:    leaq   +0x91d9cf(%rip),%rdx     <0x1d81aa7>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4c8:    leaq   +0xb0b689(%rip),%r8      <0x1f6f768>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4cf:    leaq   0xfffffffffffffe80(%rbp),%rcx
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4d6:    xorl   %edi,%edi
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4d8:    call   +0xf9b63 <core::panicking::assert_failed::ha8afc05e959f3059>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4dd:    ud2    
leftwo commented 1 year ago

So, sn21 root filesystem was full, which was preventing me from logging into any zones. I cleaned up a little space and got it so I could log into the propolis zone, and I got this for the assertion failure:

Oct 15 05:16:39.918 INFO [1] Transition from Active to Failed
Oct 15 05:16:39.918 WARN [1] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 WARNING finish job 2574761 when downstairs state:Failed
thread 'tokio-runtime-worker' panicked at 'assertion failed: `(left == right)`
  left: `NotAcked`,
 right: `Acked`', /home/alan/ws/crucible/upstairs/src/lib.rs:3453:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Oct 14 22:26:28 Stopping because all processes in service exited. ]

There is more in that log file though, which shows errors coming from the downstairs. I suspect this is in the error path and we are not handling it correctly in the upstairs, but enough with suspicions, lets go get some data.

leftwo commented 1 year ago

In the log file, I see errors from two downstairs on a flush:

Oct 15 05:06:34.137 WARN [2] flow control sending more work
Oct 15 05:06:34.138 WARN [2] flow control end 
Oct 15 05:07:21.570 WARN [0] flow control start 
Oct 15 05:07:22.570 WARN [0] flow control sending more work
Oct 15 05:07:22.571 WARN [0] flow control end 

Then we hit the error:

Oct 15 05:08:35.459 WARN [2] DS Reports error Err(GenericError("database disk image is malformed\n\nCaused by:\n    Error code 11: The database disk image is malformed")) on job 2455147, DownstairsIO { ds_id: 2455147, guest_id: 2454148, work: Flush { dependencies: [2454020, 2454021, 2454022, 2454023, 2454024, 2454025, 2454026, 2454027, 2454028, ... (removed some massive dependency list) ... 2455139, 2455140, 2455141, 2455142, 2455143, 2455144, 2455145, 2455146], flush_number: 5086, gen_number: 1, snapshot_details: None }, state: {2: InProgress, 1: InProgress, 0: InProgress}, ack_status: NotAcked, replay: false, data: None, read_response_hashes: [] } EC, : downstairs

That downstairs then goes offline:

Oct 15 05:08:35.459 INFO [2] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 Active Active Active ds_transition to Failed
Oct 15 05:08:35.459 INFO [2] Transition from Active to Failed
Oct 15 05:08:35.460 WARN [2] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 WARNING finish job 2455148 when downstairs state:Failed

And, a few minutes later downstairs 1 hits an error:

Oct 15 05:16:39.917 WARN [1sh] DS Reports error Err(GenericError("database disk image is malformed\n\nCaused by:\n    Error code 11: The database disk image is malformed")) on job 2574760, DownstairsIO { ds_id: 2574760, guest_id: 2573761, work: Flush { dependencies: [2573796, 2573797, ..(snip).. 2574757, 2574758, 2574759], flush_number: 5375, gen_number: 1, snapshot_details: None }, state: {1: InProgress, 2: Skipped, 0: InProgress}, ack_status: NotAcked, replay: false, data: None, read_response_hashes: [] } EC, : downstairs

This is the 2nd downstairs to go offline. It's for a different job as well.

Then, downstairs one goes offline and we get our panic:

Oct 15 05:16:39.918 INFO [1] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 Active Active Failed ds_transition to Failed
Oct 15 05:16:39.918 INFO [1] Transition from Active to Failed
Oct 15 05:16:39.918 WARN [1] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 WARNING finish job 2574761 when downstairs state:Failed
thread 'tokio-runtime-worker' panicked at 'assertion failed: `(left == right)`
  left: `NotAcked`,
 right: `Acked`', /home/alan/ws/crucible/upstairs/src/lib.rs:3453:17

So, we are in a "write more code" sort of situation here. We have not yet written the code to re-join a downstairs that has received an error during write or flush, so any failed downstairs stays failed. When a 2nd downstairs goes offline, we then have another problem that we can't ACK a write as we now have two downstairs failed.

However, we have more to debug.. what is this database disk image is malformed message about???

leftwo commented 1 year ago

So, where are the downstairs? I believe this issue was with the 2nd disk on this instance (because I was sending traffic to it). Here is the VolumeConstructionRequest for that disk:

Oct 14 01:44:43.703 INFO Creating storage device zpool of kind Nvme
Oct 14 01:44:43.703 INFO Creating Crucible disk from request Volume { id: 5a8d8df7-6f99-4ba9-bb9f-d4473a080236, block_size: 4096, sub_volumes: [Region { block_size: 4096, opts: CrucibleOpts { id: 5a8d8df7-6f99-4ba9-bb9f-d4473a080236, target: [[fd00:1122:3344:101::c]:19000, [fd00:1122:3344:101::e]:19000, [fd00:1122:3344:101::6]:19000], lossy: false, flush_timeout: None, key: Some("tD/nIRTAv3hz2FR5sO/AcWwM2hbw+BPSr9X/jNcHKu0="), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: false }, gen: 0 }], read_only_parent: None }

So, my downstairs are at:

[[fd00:1122:3344:101::c]:19000,
 [fd00:1122:3344:101::e]:19000, 
 [fd00:1122:3344:101::6]:19000]
leftwo commented 1 year ago

And, which downstairs are the 0,1,2?

Oct 14 01:44:43.704 INFO Crucible 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 has session id: 5d8910d5-1e3b-4d27-8437-81dea28e8912
Oct 14 01:44:43.704 INFO [0] connecting to [fd00:1122:3344:101::c]:19000, looper: 0
Oct 14 01:44:43.704 INFO [1] connecting to [fd00:1122:3344:101::e]:19000, looper: 1
Oct 14 01:44:43.705 INFO [2] connecting to [fd00:1122:3344:101::6]:19000, looper: 2

Also, I see our UUID here, 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 which we will need later as well.

Another piece we need is the UUID of the downstairs at 2. I can also get that from the log when the upstairs first connects to it:

Oct 14 01:44:43.708 INFO [2] Got region def RegionDefinition { block_size: 4096, extent_size: Block { value: 16384, shift: 12 }, extent_count: 1600, uuid: 12bb62da-7f59-42af-ae47-f61010524d6a, encrypted: true }

So, the UUID for downstairs 2 is: 12bb62da-7f59-42af-ae47-f61010524d6a

Our next step now is to figure out where [fd00:1122:3344:101::6] is, as that one gave us the first error.

Some hackery here to get the I.P.s for the crucible zones. This works because I only have one sled and all my crucible zones are on this sled:

for cz in $(zoneadm list | grep crucible); do echo $cz; zlogin $cz ifconfig -a | grep 3344:101; echo ""; done

And, from that output, I find my zone:

oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
        inet6 fd00:1122:3344:101::6/64
leftwo commented 1 year ago

Logging into the zone that contains the downstairs where we saw the first error, I can dump that downstairs log:

gimlet-sn21 # zlogin oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[Connected to zone 'oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b' pts/11]
Last login: Sun Oct 16 22:48:26 on pts/8
The illumos Project     integration-0-gfc8ed03fa4       October 2022
root@oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b:/data/regions/12bb62da-7f59-42af-ae47-f61010524d6a# cat $(svcs -L downstairs:downstairs-12bb62da-7f59-42af-ae47-f61010524d6a)
[ Oct 13 18:40:20 Disabled. ]
[ Oct 13 18:40:20 Enabled. ]
[ Oct 13 18:40:20 Executing start method ("/opt/oxide/crucible/bin/downstairs_method_script.sh"). ]
DTrace probes registered okay
Opened existing region file "/data/regions/12bb62da-7f59-42af-ae47-f61010524d6a/region.json"
UUID: 12bb62da-7f59-42af-ae47-f61010524d6a
Blocks per extent:16384 Total Extents: 1600
Using address: [fd00:1122:3344:101::6]:19000
Repair listens on [fd00:1122:3344:101::6]:23000
Using repair address: [fd00:1122:3344:101::6]:23000
No SSL acceptor configured
listening on [fd00:1122:3344:101::6]:19000
Oct 14 01:40:22.901 INFO listening, local_addr: [fd00:1122:3344:101::6]:23000
accepted connection from [fd00:1122:3344:101::12]:42589
upstairs UpstairsConnection { upstairs_id: 5a8d8df7-6f99-4ba9-bb9f-d4473a080236, session_id: 0b038ccc-3a95-4eda-8d81-9bec7043091b, gen: 0 } connected
UpstairsConnection { upstairs_id: 5a8d8df7-6f99-4ba9-bb9f-d4473a080236, session_id: 0b038ccc-3a95-4eda-8d81-9bec7043091b, gen: 0 } is now active (read-write)
Current flush_numbers [0..12]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Downstairs has completed Negotiation
ERROR: connection([fd00:1122:3344:101::12]:42589): inactivity timeout
jmpesp commented 1 year ago

This error looks like it is coming from sqlite3!

case SQLITE_CORRUPT:    z = "database disk image is malformed";      break;

Double plus ungood.

leftwo commented 1 year ago

From the propolis log, I can see the UUIDs for 0 and 1 as well:

Oct 14 01:44:43.708 INFO [0] Got region def RegionDefinition { block_size: 4096, extent_size: Block { value: 16384, shift: 12 }, extent_count: 1600, uuid: 6dd7bc24-b356-434b-8bdb-bccdde31da53, encrypted: true }
Oct 14 01:44:43.708 INFO [1] Got region def RegionDefinition { block_size: 4096, extent_size: Block { value: 16384, shift: 12 }, extent_count: 1600, uuid: 0857feab-7b60-4ebe-a166-1f15b54aa140, encrypted: true }

Specifically: 6dd7bc24-b356-434b-8bdb-bccdde31da53 and 0857feab-7b60-4ebe-a166-1f15b54aa140

Now, let's go find those on the other two crucible zones.

jmpesp commented 1 year ago

So, sn21 root filesystem was full

I tested (using openzfs version zfs-2.1.4-0ubuntu0.1) what happens to our sqlite when a zfs filesystem reaches its quota, but all I could see from sqlite was

Error: in prepare, disk I/O error (10)

Fortunately or unfortunately not the same "malformed" image.

leftwo commented 1 year ago

Yeah, I was going to try something similar, once I got sn21 back online. But, in this situation it was not the region directory that ran out of space, it was / And, this may have also put memory pressure on all running processes. So it's possible the failure came from the downstairs not being able to allocate memory. Maybe.

I was hoping to test some of those theories.

And, the "good" news is that the downstairs was taken offline, so the upstairs can (after more code is written) do a better job of handling this.

leftwo commented 1 year ago

I saw these messages again on sn-21. This time the root filesystem was not full, nor under any especially heavy workloads.

No datasets were reporting errors, and a zpool scrub did not register any problems with the crucible datasets.

leftwo commented 1 year ago

I saw this error twice today on sn21. In both cases, the error was during a flush. I was running with some new twists that might shed some additional light on the failure. The man difference is that I was using the parallel flush WIP code, where we try to flush all dirty extents at the same time.
When doing a flush, the "newish" checksum code does a re-read of an extent (into memory) then re-computes the checksum and stores that with the flush.

For my test setup, I had 4k blocks with 32768 extents per block, so 134217728 byes for each flush (to re-compute the hash). I believe doing this parallel flush means some number of these are all running at the same time.

I ran the test a third time and a downstairs failed with memory allocation of 134217728 bytes failed

So, I suspect the "database disk image is malformed" could be a result of memory pressure for the downstairs.

I've attached the logs from the two failing perf tests that encountered the error.

perf-test-that-failed.txt

perf-test-that-failed-again.txt

leftwo commented 1 year ago

Also on sn21 at this same time:

Dec 13 20:31:13 gimlet-sn21 genunix: NOTICE: core_log: crucible-downsta[27031] core dumped: /data/cores/core.global.crucible-downsta.27031
Dec 13 20:31:39 gimlet-sn21 tmpfs: WARNING: /tmp: File system full, swap space limit exceeded
Dec 13 20:31:39 gimlet-sn21 last message repeated 1 time
Dec 13 20:31:46 gimlet-sn21 genunix: NOTICE: core_log: svc.startd[9] core dumped: /data/cores/core.global.svc.startd.9
Dec 13 20:31:46 gimlet-sn21 tmpfs: WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
leftwo commented 1 year ago

Adding mvp tag for this, as I think it happens when we run out of memory, and I've seen failures resulting in a downstairs not able to start due to database corruption after encountering this issue.

We need to get a better source for when it happens, and plan how to either recover from it, or prevent it from happening, or really both.

leftwo commented 1 year ago

I saw this again, this time on the dogfood rack when using a 1000 GiB volume.

The full log is here: /net/catacomb/data/staff/core/issues/crucible-485/system-illumos-propolis-server:vm-18bf2e81-626a-4c34-abdc-584ea0be9615.log

I had created the volume and I was filling it up when the error(s) happened.