Closed jordanhendricks closed 7 months ago
The propolis log, the log of the faulted downstairs, the propolis-server core file I took and associated binary (which sadly does not seem to have useful DWARF information) is at /data/staff/dogfood/crucible-837 on catacomb.
so context about
Upstairs repair task running, trying again later
this log message happens specifically in the path that
not entirely sure what is then supposed to happen
per chat, it seems possible to me that an error during repair in downstairs is getting eaten by upstairs. The live repair that was running then hangs forever rather than failing out. When the downstairs returns, it is put into LiveRepairReady state, and is waiting for the previous repair to complete or fail, so that it can try again. While it is waiting, it generates the looping repair logs you see. That previous repair is properly wedged though, so it is waiting forever.
Unsure if that is the only problem going on here, but I think that may be why we see the live repair looping logs here.
With https://github.com/oxidecomputer/crucible/pull/1058 I don't believe the original conditions that existed to create this issue exist any longer, and we can probably close this issue.
Marked it close per the comment above. We've seen other instances stuck in stopping state but none of it is due to unfinished repair.
On the dogfood cluster in rack2, some ubuntu 22.04 instances started complaining about I/O errors immediately after booting and never stopped. They generally were so sad that one could not log in on the serial console, either.
Some sample errors from such a guest:
I drilled into one such instance and saw that one of the downstairs came and went a couple of times:
We determined that the downstairs not being able to keep up was likely due to a connectivity issue between gc8 (downstairs) and gc16 (instance). (See: https://github.com/oxidecomputer/meta/issues/231).
A single faulted downstairs doesn't explain why the guest was seeing I/O errors. We also see from the propolis logs that the crucible upstairs is seemingly stuck in some sort of loop related to live repair:
I then wondered what I/Os the guest was sending the device, so I traced all nvme probes for both the propolis-server USDT provider and the pid provider (glomming on
pid$N::*nvme*:
). Surprisingly, I did not see any probes firing at all, indicating nothing was making it to our nvme emulation. I also traced I/Os to the downstairs and saw nothing in flight to any of the 3. This is all strange, because the guest is clearly sending I/Os, as it continuously was complaining.Given that the path from the guest sending an I/O is: propolis nvme emulation -> crucible block device -> crucible upstairs -> etc, and the data that nothing was firing in the emulation layer or the downstairs, this suggests to me that the nvme emulation itself was wedged. A theory is that crucible was stuck, the I/O queues for the device filled up and never cleared, and the guest could then not submit new I/Os at all (hence the complaining). I would like to be able to prove that via inspecting the nvme queues in propolis, but short of better support for analyzing propolis-server core dumps, I'm not sure how to.
I then tried rebooting a different guest with the same pathology and observed that it was stuck rebooting, because crucible was stuck trying to shut down. This further makes me think that crucible is stuck in an irrecoverable way. I don't have enough context in crucible to quickly understand why.