hashsplit / hashsplit-spec

The Unlicense
7 stars 3 forks source link

Add rationale re: requiring W <= Smin #11

Open zenhack opened 4 years ago

zenhack commented 4 years ago

In the current verison of the spec, we require the window size to be less than or equal to the minimum block size. The rationale for this isn't spelled out, we should work that into the spec. Brain-dumping it here for now:

The reason is that this avoids implementations needing to worry about the initial state of the ring buffer when computing a hash -- it can just be zeroed, because by the time a split is large enough to satisfy the size requirement, those initial contents will have slid out of the window, so can't affect the result. In typical usage the window size is much smaller than the minimum block size, so this doesn't seem like it has real drawbacks. It also makes the spec simpler.

cole-miller commented 4 years ago

So in working on a design for my Rust implementation of the spec (which is part of what sidetracked me from the tree construction PR -- sorry about that) I've run into some issues around this, and I want to suggest a different approach: remove the requirement W <= S_min, but specify that in all rolling computations the ring buffer is initialized to all zeros.

The basic problem with trying to "protect" callers from depending on the initial state of the buffer is that you have to treat the beginning of each input specially. For example, if you're writing a primitive for pairing a sequence of bytes with the corresponding rolling hashes (this is a good building block for other, more intricate computations), it's not obvious what to do with the first W bytes: yield a special value, or nothing at all, or just skip them? And you probably want to signal an error if the input sequence has length < W. Both of these make the primitive more annoying to use.

On the other hand, if the initial state of the buffer is deterministic then you can start emitting checksums right away, and all the bytes are handled uniformly. "All zeros" is an obvious canonical state, probably the default in most languages when allocating a new byte buffer, so it's hard to see this causing trouble for implementors. Rust doesn't even expose a way to create a buffer that isn't zeroed (or set to some other specific state) unless you use unsafe.

zenhack commented 4 years ago

For example, if you're writing a primitive for pairing a sequence of bytes with the corresponding rolling hashes

I'm not sure I see what the utility of this is? What would you build on top of this?

I don't have a terribly strong objection to defining the initial state of the ring buffer as all zeros if it can be expressed without complicating the spec substantially, though this might be a little fiddly since the declarative definitions don't actually talk about a ring buffer at all; that's an optimization (albeit an important one). So you'd have to define X_i = 0 for i < 0 or such, which seemed a bit awkward just to handle a case that doesn't actually matter in practice... It seemed simpler to just impose this constraint and dodge the issue.

cole-miller commented 4 years ago

That primitive appears in my code as an iterator adapter, Rolling, implementing just enough logic to do the incremental hashing. There's another adapter built on top of Rolling that computes chunk boundaries, and yet another layer for allocation/slicing. I like this design because it enables a separation of concerns and loose coupling without sacrificing much efficiency (I don't think). It also lets you use the crate to just compute hashes, which is a nice bonus.

Extending X by putting X_i = 0 for i < 0 seems pretty natural to me, at least in pure-math usage, and I don't think anything else would need to change. (We also save some space by not having to spell out the reason for the W <= S_min restriction.) I'll look for any other complications and submit a pull request.

zenhack commented 3 years ago

Per discussion on #23, I think we're going to revert back to requiring this, so I'm re-opening this issue.