hashsplit / hashsplit-spec

The Unlicense
7 stars 3 forks source link

Describe the hash tree construction #12

Closed cole-miller closed 4 years ago

cole-miller commented 4 years ago

This adds a formal description of how a tree is constructed by repeatedly splitting a sequence of bytes into segments. I'm sure there are some holes that need plugging, but I think I got at least the essentials right.

The first two commits in the PR are aesthetic tweaks to the LaTeX throughout the spec:

I don't mind reverting some of these if you prefer the original versions.

Closes #10.

zenhack commented 4 years ago

Thanks! I merged the first two commits by themselves, since I wanted to push a fix that would have caused conflicts. I'll have a closer look at the substance of this tomorrow.

zenhack commented 4 years ago

This is a decent first pass. A couple thoughts:

I also think the presentation suffers a bit due to the fact that it needs to refer to varying thresholds and the way the configuration is defined makes doing that awkward. I'm wondering if we should just scrap the configuration entirely and have separate parameters for these things. We probably should fix the window size as a constant anyway, which would bring us down to 4 parameters, which is manageable. But I think this can be addressed as a separate change.

cole-miller commented 4 years ago

perhaps we should pin down the tree terminology more explicitly at the start of the section.

Makes sense.

We should probably also have a definition for TREE_C (without the threshold subscripts) that gives you the root node of the tree, since this is the artifact that we're actually after here; you describe individual levels but not where to stop.

What would TREE_C with no threshold sequence mean? The tree needs to know how many bits to examine at each level, and that information has to be passed into the top-level construction somehow. Were you thinking we'd decrease the threshold by one at each level, starting with the value in C?

I'm wondering if we should just scrap the configuration entirely and have separate parameters for these things.

I think the configuration might make sense as an element of the highest-level presentation, but agree that for the detailed definitions using concrete parameters would be clearer.

zenhack commented 4 years ago

The idea is TREE_C(X) would just be the full tree for the input X -- so it would have as exactly many levels as it needs to reach the root.

cole-miller commented 4 years ago

Ahh, sorry, I had the wrong mental picture of the construction. Fixes on the way.

cole-miller commented 4 years ago

Okay, I've rewritten the section. It took me a few tries because the most natural description of the tree construction (appearing e.g. in the Go package documentation) is bottom-up, and I wanted a recursive, top-down definition for the spec. The version I went with builds the "pruned" tree, with all the single-child nodes collapsed away, though that wouldn't be hard to change.

I ended up breaking the threshold parameter out of the configuration structure, to make the presentation easier to follow, but I left the rest of the configuration as-is for now. Finally, I've made the dependence of I(X) on T and C explicit, because the tree construction uses varying values of T.

cole-miller commented 4 years ago

Third try: this should fix the unboundedness problem pointed out by @zenhack, and also corrects some other errors I noticed in the process of rewriting (see the commit message 4faf2ed). The involvement of S_max just makes everything a bit more fiddly. Perhaps I should implement both this recursive algorithm and the bottom-up version in Haskell or something and check that they give the same results.

(Waiting to fix the conflicts until this is ready to merge.)

zenhack commented 4 years ago

You should probably go ahead and fix the conflicts; they're not entirely mechanical as a few definitions changed on master. In particular, I(X) is always defined (the way it was defined before had a mistake re: max block size), so that will require some reframing. I'll have another look at this after dinner.

bobg commented 4 years ago

I started working on my own language for tree construction, before I was aware that you were already doing this work. For what it's worth, and in case it helps, here's the gist of what I started writing (based on the implementation in github.com/bobg/hashsplit):

To build a tree, start with an empty level-0 node. This is "the current level 0 node." Then for each chunk K emerging from the split algorithm:

To "close" a node at level L:

After all chunks are added to the tree in this way:

zenhack commented 4 years ago

@cole-miller, are you still interested in working on this?

cole-miller commented 4 years ago

Sorry for dropping the ball here; I'm happy to let @bobg take over.