oconnor663 / bao

an implementation of BLAKE3 verified streaming
Other
471 stars 23 forks source link

Question: How can I verify the integrity of the outboard bytes using the blake3 root hash of the file? #42

Open redsolver opened 1 year ago

redsolver commented 1 year ago

Hi, how can I verify the integrity of the outboard bytes using the blake3 root hash of the file using Rust? The use case here is that I have a 32-byte blake3 hash for a file from a trusted source and download the entire outboard metadata into memory first and verify it before starting to stream. The reason for this is that I usually want to stream the entire file and that streaming happens over the network, where reading the entire outboard bytes first is usually faster than having to send a ton of requests for small byte ranges. I forked https://github.com/oconnor663/bao/tree/chunk_groups and changed the chunk group size to 256 KiB, but I don't think that's relevant for this question.

oconnor663 commented 1 year ago

That's a really interesting question. Consider a five-chunk Bao tree that looks like this:

                  root
                 /    \
          parent2     chunk4
        /         \
   parent0       parent1
    /   \         /   \
chunk0 chunk1 chunk2 chunk3

In that case the outboard tree contains these nodes:

                  root
                 /
          parent2
        /         \
   parent0       parent1

Currently I don't think there's any way to use the public Bao API to validate those nodes, if you don't have the chunks available to do a complete outboard decoding. But in theory you could do it. The "chaining value" (CV) of the root node (which you can reproduce using a public but #[doc(hidden)] module called guts) should match the BLAKE3 root hash, the CV of the parent2 node (setting is_root=false) should match the left half of the root node, and so on.

Having done this, you'd know that all the parent nodes were consistent with the root hash. But you still wouldn't know:

Maybe you could tell me more about your use case? Are you worried that your storage/transport isn't reliable? Or are you worried about attackers trying to DOS you? In both of those cases I'm not sure whether validating the outboard tree directly is the best way to go.

One thing I want to highlight just in case: Decoder::new_outboard doesn't require either of its inputs to be Seek. So if you open say one TcpStream of the outboard tree and another TcpStream of the contents, you can start reading verified content bytes before you've fetched the entire outboard tree. Any chance that's useful to you?

redsolver commented 1 year ago

Thanks for your answer! I'm building a decentralized content-addressed storage network, similar to IPFS. Unlike IPFS, files are not split into individual chunks. Instead I'm doing a blake3 hash of the entire file, and store the bao tree with all nodes above the 256 KiB level (with the chunk_groups branch). The idea here is that users only need to know the root hash of a large file to stream or download it from a completely untrusted network.

So right now I'm storing files like this: FULL_FILE_BYTES + BAO_OUTBOARD_BYTES + 16 bytes of version information and bao length

The generated CID (content identifier) consists of version + type + blake3_hash (32 bytes) + size of file (little endian, uses as many bytes as it needs to store the number).

So if I know the CID of a specific video I want to watch, I just ask on the network if someone stores it. If yes, I first download the BAO_OUTBOARD_BYTES part of the file - because I know the total size of the file without the bao metadata from the CID, I can directly download the correct range. After fetching the bao outboard bytes, I can now start streaming parts of the video/file I'm interested in and want to verify the integrity of these parts on the fly with the root blake3 hash (from the CID) and the bao bytes I fetched.

The outboard files are at the end of the file because some applications using the network have their own integrity verification mechanisms (for example with authenticated encryption) and in that case they can start streaming the file directly without needing to avoid potential bao bytes at the start of the file.

redsolver commented 1 year ago

So to be precise I don't really need to verify the entire bao outboard bytes with the root hash, I just thought that would be easier to implement because I'm holding the entire bao tree in memory anyways.

I think I'm looking for an API that looks like this:

pub fn verify_integrity(bytes: Vec<u8>, offset: u32, bao_outboard_bytes: Vec<u8>, blake3_hash: Vec<u8>) -> bool {
oconnor663 commented 1 year ago

My instinct is that you can't fully vet the outboard tree until you have the contents, and it's probably not worth the complexity to get partway there.

I can now start streaming parts of the video/file I'm interested in

Are you implementing video seeking through this API? Have you considered adding some code on the server/sender side to extract "slices"? That's Bao's way of saying "I want to read just this range of bytes, serve me those bytes along with just the specific tree nodes I need to verify them."

redsolver commented 1 year ago

Thanks for your help, I now ended up with this which seems to do what I need: (using code from https://github.com/oconnor663/bao/issues/34#issuecomment-1176595618)

pub fn verify_integrity(
    chunk_bytes: Vec<u8>,
    offset: u64,
    bao_outboard_bytes: Vec<u8>,
    blake3_hash: Vec<u8>,
) -> Result<u8, anyhow::Error> {
    let mut extractor = bao::encode::SliceExtractor::new_outboard(
        FakeSeeker::new(&chunk_bytes[..]),
        Cursor::new(&bao_outboard_bytes),
        offset,
        262144,
    );
    let mut slice = Vec::new();
    extractor.read_to_end(&mut slice)?;

    let mut decoded = Vec::new();

    let mut decoder =
        bao::decode::SliceDecoder::new(&*slice, &bao::Hash::from(from_vec_to_array(blake3_hash)), offset, 262144);

    decoder.read_to_end(&mut decoded)?;

    Ok(1)
}

The video seeking/streaming is implemented in my code directly because it has some optimizations for streaming over the network like caching chunks. I'm just calling this new method to verify the integrity of a chunk before storing it in the local disk cache and showing it to the user.

oconnor663 commented 1 year ago

If you have the ability to seek through your file contents from the client side, then there's no particular need to use SliceExtractor. Instead, the client can just use Decoder::seek to accomplish the same thing, without paying the memory cost of materializing the slice:

use std::error::Error;
use std::io::prelude::*;
use std::io::{Cursor, SeekFrom};

fn main() -> Result<(), Box<dyn Error>> {
    let bigfile = vec![0x42; 1_000_000];
    let (outboard, hash) = bao::encode::outboard(&bigfile);
    let read_start = 123_456;
    let read_len = 654_321;

    let mut reader =
        bao::decode::Decoder::new_outboard(Cursor::new(&bigfile), Cursor::new(&outboard), &hash);
    reader.seek(SeekFrom::Start(read_start as u64))?;

    // Here we read the 654,321 content bytes into a Vec, but we could e.g. write them to stdout
    // without ever allocating memory to hold the whole thing.
    let mut decoded = vec![0; read_len];
    reader.read_exact(&mut decoded)?;

    assert_eq!(decoded, &bigfile[read_start..][..read_len]);
    Ok(())
}

Note that in the above example, the client doesn't even need the whole outboard tree. It only needs something like a Cursor that can seek through it.

The advantage of slicing is that it doesn't require the client to be able to seek (or to do a network round trip for each seek), and it also doesn't require the client to have two readers/cursors open:

use std::error::Error;
use std::io::prelude::*;
use std::io::Cursor;

fn main() -> Result<(), Box<dyn Error>> {
    let bigfile = vec![0x42; 1_000_000];
    let (outboard, hash) = bao::encode::outboard(&bigfile);
    let read_start = 123_456;
    let read_len = 654_321;

    // The server side provides this stream over e.g. HTTP.
    let mut slice_stream = bao::encode::SliceExtractor::new_outboard(
        Cursor::new(&bigfile),
        Cursor::new(&outboard),
        read_start,
        read_len,
    );

    // The client side consumes the slice stream. The slice itself is never materialized on either
    // end.
    let mut decode_stream =
        bao::decode::SliceDecoder::new(&mut slice_stream, &hash, read_start, read_len);

    // Here we read the 654,321 content bytes into a Vec, but we could e.g. write them to stdout
    // without ever allocating memory to hold the whole thing.
    let mut decoded = Vec::new();
    decode_stream.read_to_end(&mut decoded)?;

    assert_eq!(
        decoded,
        &bigfile[read_start as usize..][..read_len as usize],
    );
    Ok(())
}
redsolver commented 1 year ago

Thanks, your second example is an optimized version of what I need.

I can't really use the first example, because I'm calling the API from native Dart code or JavaScript in the web browser (with WASM) and the file itself is not downloaded/streamed in Rust. So implementing a seekable Cursor in Rust does not seem like the right solution. That's also the reason I don't really care about the decode result, I just want to verify the integrity.