oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
166 stars 17 forks source link

Simplify block context for reads #1391

Closed mkeeter closed 2 months ago

mkeeter commented 2 months ago

Right now, the ReadResponse sends back a Vec<Option<BlockContext>>, i.e. an Option<BlockContext> for each block. The BlockContext in turn contains

pub struct BlockContext {
    pub hash: u64,
    pub encryption_context: Option<EncryptionContext>,
}

If encryption_context is populated, then sending hash to the Upstairs is unnecessary: we are using authenticated encryption, so we know whether the block data + context is valid based on whether it decrypted successfully.

This PR removes hash from the ReadResponse message in favor of a new enum:

#[derive(Debug, PartialEq, Copy, Clone, Serialize, Deserialize)]
pub enum ReadBlockContext {
    Empty,
    Encrypted { ctx: EncryptionContext },
    Unencrypted { hash: u64 },
}

This does not change the on-disk format or write message format, which both continue to use hashes:

This PR is a step towards RFD 490 § Metadata reduction, which proposes to not store block hashes for encrypted data on disk. If in the future we don't store block hashes for encrypted data, we would not be able to send them over the network; this PR removes that future hurdle.

However, the PR stands alone as a small optimization (39 → 32 bytes per block) that simplifies program behavior (no need to think about what happens if encryption fails but the hash matches, or vice versa).

¹ on_disk_hash is also technically superfluous if we already have hash (see https://github.com/oxidecomputer/crucible/issues/1161), but this PR doesn't change it

jmpesp commented 2 months ago

If encryption_context is populated, then sending hash to the Upstairs is unnecessary: we are using authenticated encryption, so we know whether the block data + context is valid based on whether it decrypted successfully.

Isn't it necessary? Without it, the Upstairs can't distinguish between corruption in transit and a block where either the data or encryption context is corrupted. Those are two different scenarios where the Upstairs should take different actions (retry or bail).

mkeeter commented 2 months ago

Isn't it necessary? Without it, the Upstairs can't distinguish between corruption in transit and a block where either the data or encryption context is corrupted.

I don't think adding the hash will help us to distinguish between these two cases.

Right now, the Downstairs doesn't verify the hash before sending it over the wire, so corruption in transit versus on disk both look the same: the hash or encryption context doesn't match at the Upstairs. Sending the hash doesn't give us any extra information about where corruption happens, because we don't guarantee that the hash is valid before sending it.

(I think this is intentional: having the Downstairs check the hash would add a bunch of overhead, and we've already got ZFS doing checksums under the hood for us)

jmpesp commented 2 months ago

I think it depends on whether or not we will ever receive a corrupted data back from a ZFS* read instead of an error.

If it is possible for the Downstairs to receive corrupted data back from a read, then there isn't a meaningful difference between what the integrity hash allows the Upstairs to distinguish over having the encryption context: a corrupted read would mean that both the integrity hash doesn't match and that the encryption context won't decrypt the block, and the Upstairs cannot distinguish between these. Asking for a retry because the integrity hash is wrong (thinking that it was corrupted on the wire) won't fix it.

If it is not possible to receive corrupted data back from a read, then there is a difference: the Downstairs does validate the integrity hash before accepting writes, which (if validation succeeds) means that the data was not corrupted in transit, and if the data comes back at all from a read then it wasn't corrupted at rest either. Then Downstairs doesn't have to validate the hash before sending back to the Upstairs because the only other possible corruption of the data is in transit back to the Upstairs.

*: this doesn't apply to non-checksumming filesystems, but our production deployments always use ZFS :)

mkeeter commented 2 months ago

If it is possible for the Downstairs to receive corrupted data back from a read, then there isn't a meaningful difference between what the integrity hash allows the Upstairs to distinguish over having the encryption context: a corrupted read would mean that both the integrity hash doesn't match and that the encryption context won't decrypt the block, and the Upstairs cannot distinguish between these. Asking for a retry because the integrity hash is wrong (thinking that it was corrupted on the wire) won't fix it.

Yes, agreed – in a world where ZFS can give us corrupted data, we can't tell the difference between in-transit and on-disk corruption, so sending the hash doesn't give us extra information.

If it is not possible to receive corrupted data back from a read, then there is a difference: the Downstairs does validate the integrity hash before accepting writes, which (if validation succeeds) means that the data was not corrupted in transit, and if the data comes back at all from a read then it wasn't corrupted at rest either. Then Downstairs doesn't have to validate the hash before sending back to the Upstairs because the only other possible corruption of the data is in transit back to the Upstairs.

I also agree with this, but think it proves more than what you're saying – in fact, it proves that sending the hash is superfluous 😄

In a world where ZFS cannot give us corrupted data:


This analysis ignores cases where someone deliberately corrupts files on disk; in such a situation, the attacker has full control over whether the hash matches or not, so it still doesn't give us an extra information.

jmpesp commented 2 months ago

Yes, agreed – in a world where ZFS can give us corrupted data, we can't tell the difference between in-transit and on-disk corruption, so sending the hash doesn't give us extra information.

Ignoring in-transit corruption for a sec, if the Downstairs can receive corrupted data back from ZFS, then it can tell if the integrity hash is wrong and send an appropriate error to the Upstairs. We don't currently check the hash because we believe that ZFS can't return us corrupted data: I think this is a safe assumption to keep making, but if it isn't then we can invent a new error type as appropriate.

Therefore, sending the hash doesn't give us any extra information

I still don't follow. Right now the behaviour of validate_encrypted_read_response is:

    // We'll start with decryption, ignoring the hash; we're using authenticated
    // encryption, so the check is just as strong as the hash check.
    //
    // Note: decrypt_in_place does not overwrite the buffer if
    // it fails, otherwise we would need to copy here. There's a
    // unit test to validate this behaviour.
    use aes_gcm_siv::{Nonce, Tag};
    let decryption_result = encryption_context.decrypt_in_place(
        data,
        Nonce::from_slice(&block_encryption_ctx.nonce[..]),
        Tag::from_slice(&block_encryption_ctx.tag[..]),
    );
    if decryption_result.is_ok() {
        Ok(Validation::Encrypted(*block_encryption_ctx))
    } else {
        // Validate integrity hash before decryption
        let computed_hash = integrity_hash(&[
            &block_encryption_ctx.nonce[..],
            &block_encryption_ctx.tag[..],
            data,
        ]);

        if computed_hash == context.hash {
            // No encryption context combination decrypted this block, but
            // one valid hash was found. This can occur if the decryption
            // key doesn't match the key that the data was encrypted with.
            error!(log, "Decryption failed with correct hash");
            Err(CrucibleError::DecryptionError)
        } else {
            error!(
                log,
                "No match for integrity hash\n\
             Expected: 0x{:x} != Computed: 0x{:x}",
                context.hash,
                computed_hash
            );
            Err(CrucibleError::HashMismatch)
        }
    }

I agree with the ordering here: attempt authenticated decryption first, then only if it fails check the integrity hash. The authenticated decryption is a "stronger" check anyway. If the block was corrupted in transit, as it's written today this function will likely return HashMismatch instead of DecryptionError.

Following where that error is checked though, both process_io_completion_inner and process_io_completion right now panic for both error types:

process_io_completion_inner does:

            Some(CrucibleError::DecryptionError) => {
                // We should always be able to decrypt the data.  If we
                // can't, then we have the wrong key, or the data (or key)
                // is corrupted.
                error!(
                    self.clients[client_id].log,
                    "Authenticated decryption failed on job: {:?}", ds_id
                );
                panic!(
                    "[{}] Authenticated decryption failed on job: {:?}",
                    client_id, ds_id
                );
            }

and process_io_completion does:

                        // If a read job fails, we sometimes need to panic.
                        IOop::Read { .. } => {
                            // It's possible we get a read error if the
                            // downstairs disconnects. However XXX, someone
                            // should be told about this error.
                            //
                            // Some errors, we need to panic on.
                            match e {
                                CrucibleError::HashMismatch => {
                                    panic!(
                                        "[{}] {} read hash mismatch {:?} {:?}",
                                        self.client_id, ds_id, e, job
                                    );
                                }
                                CrucibleError::DecryptionError => {
                                    panic!(
                                        "[{}] {} read decrypt error {:?} {:?}",
                                        self.client_id, ds_id, e, job
                                    );
                                }

We don't have anything where the Upstairs would re-request the read result or rerun the read job if it detects a hash mismatch like this. Given a network problem that corrupts packets, a whole bunch of Upstairs could panic, then on some sort of reboot (of the instance or whatnot) all go into reconciliation due to the panics leaving them in a state that requires it. The Upstairs could be smarter about not panicking on a hash mismatch, and somehow degrade gracefully instead.

Writing all this out I think there's a broader problem to address, because such a network problem could potentially corrupt any frame that is sent between the Upstairs and Downstairs, probably leading to a whole bunch of error reading from FramedReader errors on both sides. We should probably be embedding a checksum into the frame itself to detect this? This may also help reject invalid frames (like when I mistakenly was issueing GET requests against what turned out to be a Downstairs!) I'm ok not recording integrity hash in efforts to reduce stored metadata if we instead perform that checksum check at a "lower" level than the Message, so to speak. What to you think?

jmpesp commented 2 months ago

Yeah, I think I get it now. I had to draw this out, assuming that network corruption would manifest as bit flips:

  ----------------- ----------------- ----------------- -----------------
                                      Bit flip in hash  

                                      **No**            **Yes**

  Bit flip in       **No**            Decryption        Decryption
  data + context                      success           success
                                      hash match        hash mismatch

                    **Yes**           Decryption fail   Decryption fail
                                      hash mismatch     hash mismatch
  ----------------- ----------------- ----------------- -----------------

Having the hash doesn't help us, because when decryption fails there's a hash mismatch, and when decryption succeeds the hash doesn't matter. Sorry I was so confused about this. In fact I think the only part where we can reliably detect packet corruption is the case where the decryption succeeds and there's a hash mismatch, meaning we can say there was a bit flip in the integrity hash part of the frame.

I was hoping to come out of this with some ability of the Upstairs to tell the difference between a decryption failure it can somehow retry (by re-requesting the block or something), and a decryption failure it knows not to retry (due to on-disk corruption). If the Downstairs only validates the hash that is sent by the Upstairs before a write, and the write is successful, then corruption can occur via the disk or network and the Upstairs can't tell the difference. This makes me uncomfortable w.r.t the potential for cascading reconciliations. I was (again) hoping that we could detect network corruption and not attempt reconciliation over what we can determine to be an unreliable network.

I'll copy what you said in a DM here for posterity: we either have to

where unfortunately both of those options require computing a full hash of the block data in the Downstairs.