support larger "chunk groups" for reduced space overhead

oconnor663 commented 4 years ago

The chunk size of BLAKE3 is fixed for all of time at 1024 bytes. However, there's nothing stopping Bao from using larger "chunk groups" on the wire and on disk. For example, Bao could work with chunk groups of 64 KiB. Internally, each group would still need to be hashed in 1024-byte chunks, so CPU performance would be the same as before. (And crucially, the BLAKE3 root hash would also be unchanged.) However, the space overhead of storing and transmitting interleaved parent nodes would be reduced. This is the difference between 6% space overhead, which is too much for some applications, and 0.1% space overhead, which presumably no one cares about. However, the overhead of seeking would increase, which could hurt applications that do lots of scattered small reads.

Although all root hashes would still be standard BLAKE3, the encoding formats of Bao clients using different chunk group sizes would not be compatible. A big open question is how configurable this should be: Can we document clearly enough that it's a compatibility issue? And what should the default be?

SonnyX commented 3 years ago

In the case of the library, I'd say leave it up to the developer, the default should be the default BLAKE3 chunk size, where-as the implementor of your library could choose a multiple of 2 that is bigger than 10 (BLAKE's default chunk size).

Plus one for this feature, considering using it for improving the infrastructure of the free game RenegadeX. The size I would personally like to use is among the order of 2^16 or 2^20. I've looked through your code, however I doubt it's as simple as only changing the CHUNK_SIZE parameter in your lib.rs file.

oconnor663 commented 3 years ago

Yes it won't quite be as simple as a constant change, but hopefully the overall structure of the code shouldn't need to change too much. I might take a shot at this next time I have a free Saturday :)

gcxfd commented 3 years ago

maybe can use a hash tree , one level merge by two sub-level hash, the bottom level can be truncated to save space

dignifiedquire commented 2 years ago

Agree, it would be very useful to go to 64kib chunk size to minimize space overhead. While 6% sounds small, it does add up when looking at TB scale of data.

gcxfd commented 2 years ago

maybe can use a hash tree , one level merge by two sub-level hash, the bottom level can be truncated to save space

blake3 is based on merkle tree, but the exposed interface cannot export merkle tree.

bao implements blake3 streaming verification, but cannot resize the underlying chunks (see support larger "chunk groups" for reduced space overhead ).

That is, bao consumes 6% extra storage space to record the merkle tree, which is a significant overhead for a distributed content index.

So, I implemented blake3_merkle to export 32 bytes of hash per 1MB of content with an additional storage overhead of only 0.3‱.

The merkle tree can generate hash values consistent with blake3.

When the content is less than or equal to 1MB, the merkle tree has only one node, and the hash of this node is equal to the hash of blake3.

./examples/main.rs As follows :

use blake3_merkle::Merkle;

use std::{env, error::Error, fs::File, io::copy};

fn main() -> Result<(), Box<dyn Error>> {
  let fpath = env::current_dir()?.join("test.pdf");

  let mut blake3 = blake3::Hasher::new();
  copy(&mut File::open(&fpath)?, &mut blake3)?;

  let mut merkle = Merkle::new();
  copy(&mut File::open(&fpath)?, &mut merkle)?;
  merkle.finalize();
  dbg!(&merkle.li);
  dbg!(merkle.blake3());
  dbg!(blake3.finalize());
  Ok(())
}

Run ./example.main.sh and the output is as follows

[examples/main.rs:14] &merkle.li = [
    HashDepth {
        hash: Hash(
            "eb896f431b7ff8acb4749b54981d461359a01ded0261fa0da856dd28bf29d3b3",
        ),
        depth: 10,
    },
    HashDepth {
        hash: Hash(
            "4a84cc85f03f47a7c32755f8d9d81c5d3f3e04548ee8129fd480cb71c7dbc5b4",
        ),
        depth: 10,
    },
    HashDepth {
        hash: Hash(
            "fbfe78e550b355cb6775e324c4fed7eb987084b115dca599aaf40056bfb031c3",
        ),
        depth: 10,
    },
    HashDepth {
        hash: Hash(
            "392878c3bdc9c315d6cc8a1721d8cd0a39e49ac8716f4cb8cdf6cf83fbb666f5",
        ),
        depth: 6,
    },
]
[examples/main.rs:15] merkle.blake3() = Hash(
    "74a79d0bc37dcac64c493e872252f19e8bdb32dee306481a6827fa037b378c76",
)
[examples/main.rs:16] blake3.finalize() = Hash(
    "74a79d0bc37dcac64c493e872252f19e8bdb32dee306481a6827fa037b378c76",
)

jonahkaye commented 2 years ago

Hi, I am working on a file storage application using bao, but finding that the 6% overhead is a bit too large for us. Was the variable chunk group size ever implemented / are there plans to do so? Thanks!

oconnor663 commented 2 years ago

It's definitely on the "roadmap", but I'm not sure exactly when I'll get around to doing it. In their comment above, @gcxfd linked to some code that has this working today, and you might want to give that a try.

jonahkaye commented 2 years ago

Ok thank you @oconnor663! Another overhead related question. We want to only really use the OBAO files to cut down on overhead, but also want to be using slices for merkle proof verification purposes. The documentation says that the OBAO files can only ever be used with the whole input file, and I was wondering if there is a theoretical reason for this limitation? If not, would you want me to fork and add the functionality to encode slices using OBAO files and the selected file input slice?

oconnor663 commented 2 years ago

@jonahkaye oh that's interesting. It's doable today with a hack and a bit of extra complexity. The slice extractor expects to be used with the whole file, because it's going to seek to specific offsets. There are two important details it needs to take care of:

It needs all the parent nodes on the path from the root to the target chunks. (This is pretty complicated.)
It needs to read the target chunks at their 1024-byte chunk boundaries, even if the caller's slice begins or ends partway through a chunk. (This isn't too bad.)

If you're using an outboard encoding, you don't need to worry about (1) yourself, so if you want to avoid supplying the whole input file you only need to take care of (2). Here's an example of hacking that together, using a "fake" seeker that doesn't actually seek (because we've already given it exactly the input we know it's going to need):

// [dependencies]
// bao = "0.12.0"
// rand = "0.8.5"

use rand::prelude::*;
use std::io::prelude::*;
use std::io::{Cursor, SeekFrom};

struct FakeSeeker<R: Read> {
    reader: R,
    bytes_read: u64,
}

impl<R: Read> FakeSeeker<R> {
    fn new(reader: R) -> Self {
        Self {
            reader,
            bytes_read: 0,
        }
    }
}

impl<R: Read> Read for FakeSeeker<R> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        let n = self.reader.read(buf)?;
        self.bytes_read += n as u64;
        Ok(n)
    }
}

impl<R: Read> Seek for FakeSeeker<R> {
    fn seek(&mut self, _: SeekFrom) -> std::io::Result<u64> {
        // Do nothing and return the current position.
        Ok(self.bytes_read)
    }
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1 MiB of random bytes
    let mut whole_input = vec![0u8; 1 << 20];
    rand::thread_rng().fill(&mut whole_input[..]);

    // Create an outboard encoding.
    let mut obao = Vec::new();
    let mut encoder = bao::encode::Encoder::new_outboard(Cursor::new(&mut obao));
    encoder.write_all(&whole_input)?;
    encoder.finalize()?;

    // Use the outboard encoding to extract a slice for the range 2047..2049 the usual way. This
    // requires the whole input file, but it'll only read the second and third chunks.
    let mut extractor = bao::encode::SliceExtractor::new_outboard(
        Cursor::new(&whole_input[..]),
        Cursor::new(&obao[..]),
        2047,
        2,
    );
    let mut slice = Vec::new();
    extractor.read_to_end(&mut slice)?;

    // Now do the same thing, but this time supply only the second and third chunks of input. Note
    // that we need both those chunks whole, even though the target range of the slice is just a
    // couple bytes. To make this work, use a wrapper type that implements seeking as a no-op.
    let two_chunks = &whole_input[1024 * 1..1024 * 3];
    let mut extractor2 = bao::encode::SliceExtractor::new_outboard(
        FakeSeeker::new(two_chunks),
        Cursor::new(&obao[..]),
        2047,
        2,
    );
    let mut slice2 = Vec::new();
    extractor2.read_to_end(&mut slice2)?;

    // Check that we got exactly the same slice.
    assert_eq!(slice, slice2);

    Ok(())
}

oconnor663 commented 2 years ago

Alas there's no way to do this from the command line right now, unless you can implement a "fake seeking file" at the OS level (maybe a FUSE filesystem), which sounds like a lot of trouble.

oconnor663 commented 2 years ago

@jonahkaye out of curiosity, since you're one of the first people I've talked to who actually has this use case in practice, what level of space overhead would you consider "acceptable if not ideal", and what level would you consider "great, not even thinking about it anymore"? We might want to leave this configurable, but still most users will end up with whatever default we pick.

jonahkaye commented 2 years ago

@oconnor663 yah configurable would be best, but I think probably 1% would be "acceptable if not ideal" in most cases, and then if we had really large files to store maybe we would want even smaller like 0.1%? I am not sure about that to be honest, though if things are configurable then we can test and find an equilibrium between space and time tradeoffs.

devinrsmith commented 2 years ago

Naive Q: could the API be made to work with N applications of the existing logic?

bao encode f -n 2 --outboard f.2.obao

# "Equivalent" to:
# bao encode f --outboard f.1.obao
# bao encode f.1.obao --outboard f.2.obao
# rm f.1.obao

bao decode $hash f -n 2 --outboard f.2.obao f.decoded
cmp f f.decoded

Applying this to the physical oboa file format isn't necessarily what I'm implying (but it serves to illustrate the relative concept) - maybe there is some elegance in having the tunable knob be N?

oconnor663 commented 2 years ago

@devinrsmith if I understand your idea correctly, you'd need three streams to decode that: the original single input, the first OBAO encoding, and the second OBAO encoding. So probably not what you were hoping for. The thing to remember is that decoding needs to be able to "see" what input went into each parent node, without reading the whole file to recompute everything. Without being able to stream the intermediate OBAO, it's can't see that.

oconnor663 commented 2 years ago

Progress update: I have a branch up (currently one big commit) with an experimental implementation of chunk groups. Right now I'm setting the group size to 16 KiB (0.3% space overhead) and just hardcoding that. Still not totally sure whether I'll prefer a more configurable design in the end. If other folks are working on the same thing, that branch might be useful for seeing exactly which places in the code need to change. Maybe surprisingly, almost nothing changes about the overall structure of the Rust code (some structural changes in Python tests), but there are a lot of little tweaks scattered throughout.

devinrsmith commented 2 years ago

I'm not sure about the branch you have in development, and maybe it's already this idea in practice - but I think I have a concrete, elegant generalization:

Imagine introducing a "level" variable to the protocol. L = 0 is how it is defined today.

L = 1 is the same encoding order, except that the lowest layer of intermediate nodes is skipped / not encoded. Thus, the overhead is cut in half, but the number of bytes a consumer needs before verification is doubled.

L = 2, skip the lowest two layers of intermediate nodes. Overhead is 1/4 original, number of bytes for verification is quadrupled.

L = x, skip x lowest layers of intermediate nodes. Overhead is (1/2)^x original, number of bytes for verification is 2^x multiplier.

The nice thing about this is an L=0 encoding can service consumers efficiently for L>=0. In general, an L=x encoding can service consumers L>=x.

rklaehn commented 1 year ago

What's the status of this?

We are building a simple sync tool using bao and quic, and this is one of the two big things we would like to have - the other being full async support.

I guess one question is whether this should be hardcoded to one particular block size, like 16384, or left configurable using const generics. Making it configurable could be done with zero runtime cost, and I can think of reasons to make this configurable. But the philosophy of blake3 and bao seems to be to pick one value.

We could live with both, but we can say for sure that 1024 byte chunks are a bit too small for our use case.

Edit: my gut feeling would have been that a good block value for our use case would be between 16384 and 65536, with 65536 having the nice advantage that outboard size ~= 1/1024 of data size, so it is very easy to compute the outboard size in your head, and you would rarely have to worry about keeping the outboard in memory.

oconnor663 / bao

support larger "chunk groups" for reduced space overhead #34