Open leftwo opened 1 year ago
What in that retire_check is failing an assertion?
> ::nm ! grep retire
0x0000000001463c10|0x0000000000000592|FUNC |GLOB |0x0 |14 |_ZN8crucible10Downstairs12retire_check17h3b5d34ae9aea2742E
> 0000000001463c10::dis
crucible::Downstairs::retire_check::h3b5d34ae9aea2742: pushq %rbp
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+1:movq %rsp,%rbp
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+4:pushq %r15
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+6:pushq %r14
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+8:pushq %r13
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0xa: pushq %r12
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0xc: pushq %rbx
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0xd: subq $0x218,%rsp
...
...
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x474: call +0xf9b27 <core::panicking::assert_failed::h89a177d5655d43de>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x479: ud2
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x47b: leaq +0x91dc77(%rip),%rdi <0x1d81d09>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x482: leaq +0xb0b6b7(%rip),%rdx <0x1f6f750>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x489: movl $0x2e,%esi
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x48e: call +0x867aad <core::panicking::panic::he64c1b71ae3b1370>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x493: ud2
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x495: leaq +0x91d329(%rip),%rdi <anon.664ec91311d0e51b0a0a04e44da0d672.95.llvm.8491239946558363113>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x49c: leaq +0xb0b6cd(%rip),%rdx <0x1f6f780>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4a3: movl $0x2b,%esi
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4a8: call +0x867a93 <core::panicking::panic::he64c1b71ae3b1370>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4ad: ud2
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4af: movq $0x0,0xfffffffffffffe80(%rbp)
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4ba: leaq 0xfffffffffffffe78(%rbp),%rsi
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4c1: leaq +0x91d9cf(%rip),%rdx <0x1d81aa7>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4c8: leaq +0xb0b689(%rip),%r8 <0x1f6f768>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4cf: leaq 0xfffffffffffffe80(%rbp),%rcx
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4d6: xorl %edi,%edi
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4d8: call +0xf9b63 <core::panicking::assert_failed::ha8afc05e959f3059>
crucible::Downstairs::retire_check::h3b5d34ae9aea2742+0x4dd: ud2
So, sn21 root filesystem was full, which was preventing me from logging into any zones. I cleaned up a little space and got it so I could log into the propolis zone, and I got this for the assertion failure:
Oct 15 05:16:39.918 INFO [1] Transition from Active to Failed
Oct 15 05:16:39.918 WARN [1] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 WARNING finish job 2574761 when downstairs state:Failed
thread 'tokio-runtime-worker' panicked at 'assertion failed: `(left == right)`
left: `NotAcked`,
right: `Acked`', /home/alan/ws/crucible/upstairs/src/lib.rs:3453:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Oct 14 22:26:28 Stopping because all processes in service exited. ]
There is more in that log file though, which shows errors coming from the downstairs. I suspect this is in the error path and we are not handling it correctly in the upstairs, but enough with suspicions, lets go get some data.
In the log file, I see errors from two downstairs on a flush:
Oct 15 05:06:34.137 WARN [2] flow control sending more work
Oct 15 05:06:34.138 WARN [2] flow control end
Oct 15 05:07:21.570 WARN [0] flow control start
Oct 15 05:07:22.570 WARN [0] flow control sending more work
Oct 15 05:07:22.571 WARN [0] flow control end
Then we hit the error:
Oct 15 05:08:35.459 WARN [2] DS Reports error Err(GenericError("database disk image is malformed\n\nCaused by:\n Error code 11: The database disk image is malformed")) on job 2455147, DownstairsIO { ds_id: 2455147, guest_id: 2454148, work: Flush { dependencies: [2454020, 2454021, 2454022, 2454023, 2454024, 2454025, 2454026, 2454027, 2454028, ... (removed some massive dependency list) ... 2455139, 2455140, 2455141, 2455142, 2455143, 2455144, 2455145, 2455146], flush_number: 5086, gen_number: 1, snapshot_details: None }, state: {2: InProgress, 1: InProgress, 0: InProgress}, ack_status: NotAcked, replay: false, data: None, read_response_hashes: [] } EC, : downstairs
That downstairs then goes offline:
Oct 15 05:08:35.459 INFO [2] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 Active Active Active ds_transition to Failed
Oct 15 05:08:35.459 INFO [2] Transition from Active to Failed
Oct 15 05:08:35.460 WARN [2] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 WARNING finish job 2455148 when downstairs state:Failed
And, a few minutes later downstairs 1 hits an error:
Oct 15 05:16:39.917 WARN [1sh] DS Reports error Err(GenericError("database disk image is malformed\n\nCaused by:\n Error code 11: The database disk image is malformed")) on job 2574760, DownstairsIO { ds_id: 2574760, guest_id: 2573761, work: Flush { dependencies: [2573796, 2573797, ..(snip).. 2574757, 2574758, 2574759], flush_number: 5375, gen_number: 1, snapshot_details: None }, state: {1: InProgress, 2: Skipped, 0: InProgress}, ack_status: NotAcked, replay: false, data: None, read_response_hashes: [] } EC, : downstairs
This is the 2nd downstairs to go offline. It's for a different job as well.
Then, downstairs one goes offline and we get our panic:
Oct 15 05:16:39.918 INFO [1] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 Active Active Failed ds_transition to Failed
Oct 15 05:16:39.918 INFO [1] Transition from Active to Failed
Oct 15 05:16:39.918 WARN [1] 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 WARNING finish job 2574761 when downstairs state:Failed
thread 'tokio-runtime-worker' panicked at 'assertion failed: `(left == right)`
left: `NotAcked`,
right: `Acked`', /home/alan/ws/crucible/upstairs/src/lib.rs:3453:17
So, we are in a "write more code" sort of situation here. We have not yet written the code to re-join a downstairs that has received an error during write or flush, so any failed downstairs stays failed. When a 2nd downstairs goes offline, we then have another problem that we can't ACK a write as we now have two downstairs failed.
However, we have more to debug.. what is this database disk image is malformed
message about???
So, where are the downstairs? I believe this issue was with the 2nd disk on this instance (because I was sending traffic to it). Here is the VolumeConstructionRequest for that disk:
Oct 14 01:44:43.703 INFO Creating storage device zpool of kind Nvme
Oct 14 01:44:43.703 INFO Creating Crucible disk from request Volume { id: 5a8d8df7-6f99-4ba9-bb9f-d4473a080236, block_size: 4096, sub_volumes: [Region { block_size: 4096, opts: CrucibleOpts { id: 5a8d8df7-6f99-4ba9-bb9f-d4473a080236, target: [[fd00:1122:3344:101::c]:19000, [fd00:1122:3344:101::e]:19000, [fd00:1122:3344:101::6]:19000], lossy: false, flush_timeout: None, key: Some("tD/nIRTAv3hz2FR5sO/AcWwM2hbw+BPSr9X/jNcHKu0="), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: false }, gen: 0 }], read_only_parent: None }
So, my downstairs are at:
[[fd00:1122:3344:101::c]:19000,
[fd00:1122:3344:101::e]:19000,
[fd00:1122:3344:101::6]:19000]
And, which downstairs are the 0,1,2?
Oct 14 01:44:43.704 INFO Crucible 5a8d8df7-6f99-4ba9-bb9f-d4473a080236 has session id: 5d8910d5-1e3b-4d27-8437-81dea28e8912
Oct 14 01:44:43.704 INFO [0] connecting to [fd00:1122:3344:101::c]:19000, looper: 0
Oct 14 01:44:43.704 INFO [1] connecting to [fd00:1122:3344:101::e]:19000, looper: 1
Oct 14 01:44:43.705 INFO [2] connecting to [fd00:1122:3344:101::6]:19000, looper: 2
Also, I see our UUID here, 5a8d8df7-6f99-4ba9-bb9f-d4473a080236
which we will need later as well.
Another piece we need is the UUID of the downstairs at 2
. I can also get that from the log when the upstairs first connects to it:
Oct 14 01:44:43.708 INFO [2] Got region def RegionDefinition { block_size: 4096, extent_size: Block { value: 16384, shift: 12 }, extent_count: 1600, uuid: 12bb62da-7f59-42af-ae47-f61010524d6a, encrypted: true }
So, the UUID for downstairs 2 is: 12bb62da-7f59-42af-ae47-f61010524d6a
Our next step now is to figure out where [fd00:1122:3344:101::6]
is, as that one gave us the first error.
Some hackery here to get the I.P.s for the crucible zones. This works because I only have one sled and all my crucible zones are on this sled:
for cz in $(zoneadm list | grep crucible); do echo $cz; zlogin $cz ifconfig -a | grep 3344:101; echo ""; done
And, from that output, I find my zone:
oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
inet6 fd00:1122:3344:101::6/64
Logging into the zone that contains the downstairs where we saw the first error, I can dump that downstairs log:
gimlet-sn21 # zlogin oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[Connected to zone 'oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b' pts/11]
Last login: Sun Oct 16 22:48:26 on pts/8
The illumos Project integration-0-gfc8ed03fa4 October 2022
root@oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b:/data/regions/12bb62da-7f59-42af-ae47-f61010524d6a# cat $(svcs -L downstairs:downstairs-12bb62da-7f59-42af-ae47-f61010524d6a)
[ Oct 13 18:40:20 Disabled. ]
[ Oct 13 18:40:20 Enabled. ]
[ Oct 13 18:40:20 Executing start method ("/opt/oxide/crucible/bin/downstairs_method_script.sh"). ]
DTrace probes registered okay
Opened existing region file "/data/regions/12bb62da-7f59-42af-ae47-f61010524d6a/region.json"
UUID: 12bb62da-7f59-42af-ae47-f61010524d6a
Blocks per extent:16384 Total Extents: 1600
Using address: [fd00:1122:3344:101::6]:19000
Repair listens on [fd00:1122:3344:101::6]:23000
Using repair address: [fd00:1122:3344:101::6]:23000
No SSL acceptor configured
listening on [fd00:1122:3344:101::6]:19000
Oct 14 01:40:22.901 INFO listening, local_addr: [fd00:1122:3344:101::6]:23000
accepted connection from [fd00:1122:3344:101::12]:42589
upstairs UpstairsConnection { upstairs_id: 5a8d8df7-6f99-4ba9-bb9f-d4473a080236, session_id: 0b038ccc-3a95-4eda-8d81-9bec7043091b, gen: 0 } connected
UpstairsConnection { upstairs_id: 5a8d8df7-6f99-4ba9-bb9f-d4473a080236, session_id: 0b038ccc-3a95-4eda-8d81-9bec7043091b, gen: 0 } is now active (read-write)
Current flush_numbers [0..12]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Downstairs has completed Negotiation
ERROR: connection([fd00:1122:3344:101::12]:42589): inactivity timeout
This error looks like it is coming from sqlite3!
case SQLITE_CORRUPT: z = "database disk image is malformed"; break;
Double plus ungood.
From the propolis log, I can see the UUIDs for 0 and 1 as well:
Oct 14 01:44:43.708 INFO [0] Got region def RegionDefinition { block_size: 4096, extent_size: Block { value: 16384, shift: 12 }, extent_count: 1600, uuid: 6dd7bc24-b356-434b-8bdb-bccdde31da53, encrypted: true }
Oct 14 01:44:43.708 INFO [1] Got region def RegionDefinition { block_size: 4096, extent_size: Block { value: 16384, shift: 12 }, extent_count: 1600, uuid: 0857feab-7b60-4ebe-a166-1f15b54aa140, encrypted: true }
Specifically:
6dd7bc24-b356-434b-8bdb-bccdde31da53
and
0857feab-7b60-4ebe-a166-1f15b54aa140
Now, let's go find those on the other two crucible zones.
So, sn21 root filesystem was full
I tested (using openzfs version zfs-2.1.4-0ubuntu0.1) what happens to our sqlite when a zfs filesystem reaches its quota, but all I could see from sqlite was
Error: in prepare, disk I/O error (10)
Fortunately or unfortunately not the same "malformed" image.
Yeah, I was going to try something similar, once I got sn21 back online. But, in this situation it was not the region directory that ran out of space, it was / And, this may have also put memory pressure on all running processes. So it's possible the failure came from the downstairs not being able to allocate memory. Maybe.
I was hoping to test some of those theories.
And, the "good" news is that the downstairs was taken offline, so the upstairs can (after more code is written) do a better job of handling this.
I saw these messages again on sn-21. This time the root filesystem was not full, nor under any especially heavy workloads.
No datasets were reporting errors, and a zpool scrub
did not register any problems with the crucible datasets.
I saw this error twice today on sn21.
In both cases, the error was during a flush.
I was running with some new twists that might shed some additional light on the failure. The man difference is that I was using the parallel flush WIP code, where we try to flush all dirty extents at the same time.
When doing a flush, the "newish" checksum code does a re-read of an extent (into memory) then re-computes
the checksum and stores that with the flush.
For my test setup, I had 4k blocks with 32768 extents per block, so 134217728 byes for each flush (to re-compute the hash). I believe doing this parallel flush means some number of these are all running at the same time.
I ran the test a third time and a downstairs failed with memory allocation of 134217728 bytes failed
So, I suspect the "database disk image is malformed" could be a result of memory pressure for the downstairs.
I've attached the logs from the two failing perf tests that encountered the error.
Also on sn21 at this same time:
Dec 13 20:31:13 gimlet-sn21 genunix: NOTICE: core_log: crucible-downsta[27031] core dumped: /data/cores/core.global.crucible-downsta.27031
Dec 13 20:31:39 gimlet-sn21 tmpfs: WARNING: /tmp: File system full, swap space limit exceeded
Dec 13 20:31:39 gimlet-sn21 last message repeated 1 time
Dec 13 20:31:46 gimlet-sn21 genunix: NOTICE: core_log: svc.startd[9] core dumped: /data/cores/core.global.svc.startd.9
Dec 13 20:31:46 gimlet-sn21 tmpfs: WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
Adding mvp tag for this, as I think it happens when we run out of memory, and I've seen failures resulting in a downstairs not able to start due to database corruption after encountering this issue.
We need to get a better source for when it happens, and plan how to either recover from it, or prevent it from happening, or really both.
I saw this again, this time on the dogfood rack when using a 1000 GiB volume.
The full log is here: /net/catacomb/data/staff/core/issues/crucible-485/system-illumos-propolis-server:vm-18bf2e81-626a-4c34-abdc-584ea0be9615.log
I had created the volume and I was filling it up when the error(s) happened.
Core(s) generated while omicron was running with 187 disks and one instance. The instance had two disks attached.
Many core files were generated from omicron, with the 2nd oldest being this one from inside crucible.
Core file on catacomb at: /data/staff/core/gimlet-sn21/20221015/core.oxz_propolis-server_1b5ebd28-3a8f-4013-8927-99c200dbd501.propolis-server.24772