oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
158 stars 16 forks source link

Consider merging `DsState::{Faulted, Offline}` #1258

Open mkeeter opened 4 months ago

mkeeter commented 4 months ago

The documentation for enum DsState includes this lovely diagram:

 *                       │
 *                       ▼
 *                       │
 *                  ┌────┴──────┐
 *   ┌───────┐      │           ╞═════◄══════════════════╗
 *   │  Bad  │      │    New    ╞═════◄════════════════╗ ║
 *   │Version├──◄───┤           ├─────◄──────┐         ║ ║
 *   └───────┘      └────┬───┬──┘            │         ║ ║
 *                       ▼   └───►───┐       │         ║ ║
 *                  ┌────┴──────┐    │       │         ║ ║
 *                  │   Wait    │    │       │         ║ ║
 *                  │  Active   ├─►┐ │       │         ║ ║
 *                  └────┬──────┘  │ │  ┌────┴───────┐ ║ ║
 *   ┌───────┐      ┌────┴──────┐  │ └──┤            │ ║ ║
 *   │  Bad  │      │   Wait    │  └────┤Disconnected│ ║ ║
 *   │Region ├──◄───┤  Quorum   ├──►────┤            │ ║ ║
 *   └───────┘      └────┬──────┘       └────┬───────┘ ║ ║
 *               ........▼..........         │         ║ ║
 *  ┌─────────┐  :  ┌────┴──────┐  :         ▲         ║ ║
 *  │ Failed  │  :  │ Reconcile │  :         │       ╔═╝ ║
 *  │Reconcile├─◄───┤           ├──►─────────┘       ║   ║
 *  └─────────┘  :  └────┬──────┘  :                 ║   ║
 *  Not Active   :       │         :                 ▲   ▲  Not Active
 *  .............. . . . │. . . . ...................║...║............
 *  Active               ▼                           ║   ║  Active
 *                  ┌────┴──────┐         ┌──────────╨┐  ║
 *              ┌─►─┤  Active   ├─────►───┤Deactivated│  ║
 *              │   │           │  ┌──────┤           ├─◄──────┐
 *              │   └─┬───┬───┬─┘  │      └───────────┘  ║     │
 *              │     ▼   ▼   ▲    ▲                     ║     │
 *              │     │   │   │    │                     ║     │
 *              │     │   │   │    │                     ║     │
 *              │     │   │   ▲  ┌─┘                     ║     │
 *              │     │   │ ┌─┴──┴──┐                    ║     │
 *              │     │   │ │Replay │                    ║     │
 *              │     │   │ │       ├─►─┐                ║     │
 *              │     │   │ └─┬──┬──┘   │                ║     │
 *              │     │   ▼   ▼  ▲      │                ║     │
 *              │     │   │   │  │      │                ▲     │
 *              │     │ ┌─┴───┴──┴──┐   │   ┌────────────╨──┐  │
 *              │     │ │  Offline  │   └─►─┤   Faulted     │  │
 *              │     │ │           ├─────►─┤               │  │
 *              │     │ └───────────┘       └─┬─┬───────┬─┬─┘  │
 *              │     │                       ▲ ▲       ▼ ▲    ▲
 *              │     └───────────►───────────┘ │       │ │    │
 *              │                               │       │ │    │
 *              │                      ┌────────┴─┐   ┌─┴─┴────┴─┐
 *              └──────────────────────┤   Live   ├─◄─┤  Live    │
 *                                     │  Repair  │   │  Repair  │
 *                                     │          │   │  Ready   │
 *                                     └──────────┘   └──────────┘

It took me a while to wrap my head around the difference between Offline and Faulted:

Offline versus Faulted is not a property of the actual Downstairs, which is doing its own thing on the other side of the network boundary. Why do we need this distinction at all?

In other words, the OfflineFaulted transition implements policy choices about how many jobs the Upstairs is allowed to keep around. Framed this way, it's not obvious why this logic should be attached to the DsState.

Instead, I'd like to propose attaching that logic to the Upstairs itself:

This would cut through a bunch of gnarly code (e.g. the coupling between backpressure and transitions to DsState::Faulted), and seems logically consistent.

leftwo commented 4 months ago

So, I'm not sure I see the value here in this, but let me explain some of what I'm thinking and I want to hear more about your idea as well.

I think many (all?) of the states in DsState are about the upstairs policy, and in most cases the actual downstairs themselves don't know what their state is in the upstairs. We have to keep state about each downstairs somewhere in the Upstairs, so I'm curious to hear why the new way is better than what we have now.

True that the Replay state is not bearing much load any longer as we don't know when a downstairs has "caught up", and, given the code changes with the refactor, it's probably vestigial at this point. Though, we do need to know if a read IO is a replay to know if we can ignore checksum mismatches between it an other IOs that could have different data.

For the collapse of Faulted and Offline, sometimes we know that a downstairs is bad right away (Like, it has returned a write error, or failed a repair command). In those cases there is no reason to hold any jobs for that downstairs, and it must come back through LiveRepair, so we can skip them and free any resources without having to wait for the downstairs to return to do so. Also, if a downstairs has gone Offline and we have reached the 1Gib/10K jobs limit, we won't keep holding IOs for that downstairs any longer, right? And new IOs that arrive after this point, they are "skipped" as well?

In your proposal, you are moving the logic of how to handle the difference between Faulted and Offline into another part of the upstairs?

Another piece that I do like with the current model is the observability we get with the state transitions. I do find these useful when debugging.