Describe the hash tree construction

cole-miller commented 4 years ago

This adds a formal description of how a tree is constructed by repeatedly splitting a sequence of bytes into segments. I'm sure there are some holes that need plugging, but I think I got at least the essentials right.

The first two commits in the PR are aesthetic tweaks to the LaTeX throughout the spec:

replacing the ASCII notations ... and || with LaTeX versions, \dots and \mathbin{\|} (which tend to have better spacing/layout properties)
using the roman (non-italic) font variant for text subscripts like max and min (this improves spacing and I think is a little more readable)
changing \frac{x}{2^n} to x/2^n (for inline fractions I find using a solidus works better than \frac, which is vertically cramped)

I don't mind reverting some of these if you prefer the original versions.

Closes #10.

zenhack commented 4 years ago

Thanks! I merged the first two commits by themselves, since I wanted to push a fix that would have caused conflicts. I'll have a closer look at the substance of this tomorrow.

zenhack commented 4 years ago

This is a decent first pass. A couple thoughts:

It's a bit weird that we have a formal definition of something as simple as a byte, but don't have any kind of definitions prior to talking about trees and labeling nodes; perhaps we should pin down the tree terminology more explicitly at the start of the section.
We should probably also have a definition for TREE_C (without the threshold subscripts) that gives you the root node of the tree, since this is the artifact that we're actually after here; you describe individual levels but not where to stop.

I also think the presentation suffers a bit due to the fact that it needs to refer to varying thresholds and the way the configuration is defined makes doing that awkward. I'm wondering if we should just scrap the configuration entirely and have separate parameters for these things. We probably should fix the window size as a constant anyway, which would bring us down to 4 parameters, which is manageable. But I think this can be addressed as a separate change.

cole-miller commented 4 years ago

perhaps we should pin down the tree terminology more explicitly at the start of the section.

Makes sense.

We should probably also have a definition for TREE_C (without the threshold subscripts) that gives you the root node of the tree, since this is the artifact that we're actually after here; you describe individual levels but not where to stop.

What would TREE_C with no threshold sequence mean? The tree needs to know how many bits to examine at each level, and that information has to be passed into the top-level construction somehow. Were you thinking we'd decrease the threshold by one at each level, starting with the value in C?

I'm wondering if we should just scrap the configuration entirely and have separate parameters for these things.

I think the configuration might make sense as an element of the highest-level presentation, but agree that for the detailed definitions using concrete parameters would be clearer.

zenhack commented 4 years ago

The idea is TREE_C(X) would just be the full tree for the input X -- so it would have as exactly many levels as it needs to reach the root.

cole-miller commented 4 years ago

Ahh, sorry, I had the wrong mental picture of the construction. Fixes on the way.

cole-miller commented 4 years ago

Okay, I've rewritten the section. It took me a few tries because the most natural description of the tree construction (appearing e.g. in the Go package documentation) is bottom-up, and I wanted a recursive, top-down definition for the spec. The version I went with builds the "pruned" tree, with all the single-child nodes collapsed away, though that wouldn't be hard to change.

I ended up breaking the threshold parameter out of the configuration structure, to make the presentation easier to follow, but I left the rest of the configuration as-is for now. Finally, I've made the dependence of I(X) on T and C explicit, because the tree construction uses varying values of T.

cole-miller commented 4 years ago

Third try: this should fix the unboundedness problem pointed out by @zenhack, and also corrects some other errors I noticed in the process of rewriting (see the commit message 4faf2ed). The involvement of S_max just makes everything a bit more fiddly. Perhaps I should implement both this recursive algorithm and the bottom-up version in Haskell or something and check that they give the same results.

~~(Waiting to fix the conflicts until this is ready to merge.)~~

zenhack commented 4 years ago

You should probably go ahead and fix the conflicts; they're not entirely mechanical as a few definitions changed on master. In particular, I(X) is always defined (the way it was defined before had a mistake re: max block size), so that will require some reframing. I'll have another look at this after dinner.

bobg commented 4 years ago

I started working on my own language for tree construction, before I was aware that you were already doing this work. For what it's worth, and in case it helps, here's the gist of what I started writing (based on the implementation in github.com/bobg/hashsplit):

The output of the split algorithm is a sequence of "chunks,"
If H was the value of the rolling checksum when a chunk K was made, then the "level" L(K) of chunk K is Q-T, where Q is the largest integer <= 32 such that H mod 2^Q = 0 (i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold)
A node in a hashsplit tree also has a level
A level-0 node has chunks for children
A node at level N > 0 has children that are nodes at level N-1

To build a tree, start with an empty level-0 node. This is "the current level 0 node." Then for each chunk K emerging from the split algorithm:

Add K to the current level-0 node.
For each integer i in the range [0 .. L(K)), "close" the current level-i node

To "close" a node at level L:

If there is no node at level L+1, create one. This is "the current level L+1 node."
Add the closing level-L node as a child to the current level L+1 node.
Create a new level-L node. This is now "the current level L node."

After all chunks are added to the tree in this way:

If the current level-0 node has any children, "close" it, and "close" each "current" node above it in turn.
The highest-level "current" node is the prospective root of the tree.
[pruning] While the prospective root has only one child, set the prospective root to that child.
The prospective root is the root of the final tree.

zenhack commented 4 years ago

@cole-miller, are you still interested in working on this?

cole-miller commented 4 years ago

Sorry for dropping the ball here; I'm happy to let @bobg take over.

hashsplit / hashsplit-spec

Describe the hash tree construction #12