oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
175 stars 18 forks source link

crucible upstairs stuck in live repair loop resulted in continuous I/O errors in guest #837

Closed jordanhendricks closed 7 months ago

jordanhendricks commented 1 year ago

On the dogfood cluster in rack2, some ubuntu 22.04 instances started complaining about I/O errors immediately after booting and never stopped. They generally were so sad that one could not log in on the serial console, either.

Some sample errors from such a guest:

42083.744118] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1663: inode #80932: comm cron: reading directory lblock 0
[42143.743759] EXT4-fs error (device nvme0n1p1): __ext4_get_inode_loc_noinmem:4433: inode #33445: block 2379: comm cron: unable to read itable block
[42143.746472] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1663: inode #80932: comm cron: reading directory lblock 0
[42203.746123] EXT4-fs error (device nvme0n1p1): __ext4_get_inode_loc_noinmem:4433: inode #33445: block 2379: comm cron: unable to read itable block

I drilled into one such instance and saw that one of the downstairs came and went a couple of times:

[2023-07-16T22:49:57.858213302Z]  INFO: propolis-server/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Creating storage device sysbench-mysql-1 of kind Nvme
[2023-07-16T22:49:57.858314115Z]  INFO: propolis-server/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Creating Crucible disk from request Volume { id: bdb29e87-bff9-4dd7-bb55-1acdb0e29519, block_size: 512, sub_volumes: [Region { block_size: 512, blocks_per_extent: 131072, extent_count: 1600, opts: CrucibleOpts { id: bdb29e87-bff9-4dd7-bb55-1acdb0e29519, target: [[fd00:1122:3344:107::6]:19000, [fd00:1122:3344:106::8]:19000, [fd00:1122:3344:102::a]:19000], lossy: false, flush_timeout: None, key: Some("gh6Yuzy3lH2jJgYjkaSd0Mnt1GP1l1uudDAVkAVuFOw="), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: false }, gen: 1 }], read_only_parent: Some(Volume { id: a9a0f459-31a6-441c-80df-ab88d971d5ab, block_size: 512, sub_volumes: [Region { block_size: 512, blocks_per_extent: 131072, extent_count: 48, opts: CrucibleOpts { id: 09e3bab3-f79c-4d58-89fa-effb976f6a87, target: [[fd00:1122:3344:101::a]:19001, [fd00:1122:3344:105::d]:19001, [fd00:1122:3344:106::c]:19001], lossy: false, flush_timeout: None, key: Some("cM1bVwfdfxeygBfUgotFLRXk4WZUJaPV4ob4sQteHis="), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: true }, gen: 2 }], read_only_parent: None }) }
[2023-07-16T22:49:57.858522642Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Upstairs starts

...

[2023-07-16T22:51:11.99106978Z] ERROR: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: process_ds_operation error: No longer active
[2023-07-16T22:51:11.991077575Z]  WARN: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: [0] will exit pm_task, this downstairs Faulted
[2023-07-16T22:51:11.99108555Z] ERROR: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: [fd00:1122:3344:107::6]:19000: proc: [0] client work task ended, Ok(Err([0] This downstairs now in Faulted)), so we end too (lo

We determined that the downstairs not being able to keep up was likely due to a connectivity issue between gc8 (downstairs) and gc16 (instance). (See: https://github.com/oxidecomputer/meta/issues/231).

A single faulted downstairs doesn't explain why the guest was seeing I/O errors. We also see from the propolis logs that the crucible upstairs is seemingly stuck in some sort of loop related to live repair:

[2023-07-16T23:54:45.950369378Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Live Repair in progress, try again
[2023-07-16T23:55:45.953343455Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Checking if live repair is needed
[2023-07-16T23:55:45.953416544Z]  WARN: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Upstairs repair task running, trying again later
[2023-07-16T23:55:45.953432715Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Live Repair in progress, try again
[2023-07-16T23:56:45.95684997Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Checking if live repair is needed
[2023-07-16T23:56:45.956919332Z]  WARN: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Upstairs repair task running, trying again later
[2023-07-16T23:56:45.956937888Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Live Repair in progress, try again
[2023-07-16T23:57:45.959171286Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Checking if live repair is needed
[2023-07-16T23:57:45.959222754Z]  WARN: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Upstairs repair task running, trying again

I then wondered what I/Os the guest was sending the device, so I traced all nvme probes for both the propolis-server USDT provider and the pid provider (glomming on pid$N::*nvme*:). Surprisingly, I did not see any probes firing at all, indicating nothing was making it to our nvme emulation. I also traced I/Os to the downstairs and saw nothing in flight to any of the 3. This is all strange, because the guest is clearly sending I/Os, as it continuously was complaining.

Given that the path from the guest sending an I/O is: propolis nvme emulation -> crucible block device -> crucible upstairs -> etc, and the data that nothing was firing in the emulation layer or the downstairs, this suggests to me that the nvme emulation itself was wedged. A theory is that crucible was stuck, the I/O queues for the device filled up and never cleared, and the guest could then not submit new I/Os at all (hence the complaining). I would like to be able to prove that via inspecting the nvme queues in propolis, but short of better support for analyzing propolis-server core dumps, I'm not sure how to.

I then tried rebooting a different guest with the same pathology and observed that it was stuck rebooting, because crucible was stuck trying to shut down. This further makes me think that crucible is stuck in an irrecoverable way. I don't have enough context in crucible to quickly understand why.

jordanhendricks commented 1 year ago

The propolis log, the log of the faulted downstairs, the propolis-server core file I took and associated binary (which sadly does not seem to have useful DWARF information) is at /data/staff/dogfood/crucible-837 on catacomb.

faithanalog commented 1 year ago

so context about

Upstairs repair task running, trying again later

this log message happens specifically in the path that

not entirely sure what is then supposed to happen

faithanalog commented 1 year ago

per chat, it seems possible to me that an error during repair in downstairs is getting eaten by upstairs. The live repair that was running then hangs forever rather than failing out. When the downstairs returns, it is put into LiveRepairReady state, and is waiting for the previous repair to complete or fail, so that it can try again. While it is waiting, it generates the looping repair logs you see. That previous repair is properly wedged though, so it is waiting forever.

Unsure if that is the only problem going on here, but I think that may be why we see the live repair looping logs here.

leftwo commented 10 months ago

With https://github.com/oxidecomputer/crucible/pull/1058 I don't believe the original conditions that existed to create this issue exist any longer, and we can probably close this issue.

askfongjojo commented 7 months ago

Marked it close per the comment above. We've seen other instances stuck in stopping state but none of it is due to unfinished repair.