Closed cole-miller closed 4 years ago
Thanks! I merged the first two commits by themselves, since I wanted to push a fix that would have caused conflicts. I'll have a closer look at the substance of this tomorrow.
This is a decent first pass. A couple thoughts:
TREE_C
(without the threshold subscripts) that gives you the root node of the tree, since this is the artifact that we're actually after here; you describe individual levels but not where to stop.I also think the presentation suffers a bit due to the fact that it needs to refer to varying thresholds and the way the configuration is defined makes doing that awkward. I'm wondering if we should just scrap the configuration entirely and have separate parameters for these things. We probably should fix the window size as a constant anyway, which would bring us down to 4 parameters, which is manageable. But I think this can be addressed as a separate change.
perhaps we should pin down the tree terminology more explicitly at the start of the section.
Makes sense.
We should probably also have a definition for
TREE_C
(without the threshold subscripts) that gives you the root node of the tree, since this is the artifact that we're actually after here; you describe individual levels but not where to stop.
What would TREE_C
with no threshold sequence mean? The tree needs to know how many bits to examine at each level, and that information has to be passed into the top-level construction somehow. Were you thinking we'd decrease the threshold by one at each level, starting with the value in C
?
I'm wondering if we should just scrap the configuration entirely and have separate parameters for these things.
I think the configuration might make sense as an element of the highest-level presentation, but agree that for the detailed definitions using concrete parameters would be clearer.
The idea is TREE_C(X) would just be the full tree for the input X -- so it would have as exactly many levels as it needs to reach the root.
Ahh, sorry, I had the wrong mental picture of the construction. Fixes on the way.
Okay, I've rewritten the section. It took me a few tries because the most natural description of the tree construction (appearing e.g. in the Go package documentation) is bottom-up, and I wanted a recursive, top-down definition for the spec. The version I went with builds the "pruned" tree, with all the single-child nodes collapsed away, though that wouldn't be hard to change.
I ended up breaking the threshold parameter out of the configuration structure, to make the presentation easier to follow, but I left the rest of the configuration as-is for now. Finally, I've made the dependence of I(X)
on T
and C
explicit, because the tree construction uses varying values of T
.
Third try: this should fix the unboundedness problem pointed out by @zenhack, and also corrects some other errors I noticed in the process of rewriting (see the commit message 4faf2ed). The involvement of S_max
just makes everything a bit more fiddly. Perhaps I should implement both this recursive algorithm and the bottom-up version in Haskell or something and check that they give the same results.
(Waiting to fix the conflicts until this is ready to merge.)
You should probably go ahead and fix the conflicts; they're not entirely mechanical as a few definitions changed on master. In particular, I(X) is always defined (the way it was defined before had a mistake re: max block size), so that will require some reframing. I'll have another look at this after dinner.
I started working on my own language for tree construction, before I was aware that you were already doing this work. For what it's worth, and in case it helps, here's the gist of what I started writing (based on the implementation in github.com/bobg/hashsplit):
To build a tree, start with an empty level-0 node. This is "the current level 0 node." Then for each chunk K emerging from the split algorithm:
To "close" a node at level L:
After all chunks are added to the tree in this way:
@cole-miller, are you still interested in working on this?
Sorry for dropping the ball here; I'm happy to let @bobg take over.
This adds a formal description of how a tree is constructed by repeatedly splitting a sequence of bytes into segments. I'm sure there are some holes that need plugging, but I think I got at least the essentials right.
The first two commits in the PR are aesthetic tweaks to the LaTeX throughout the spec:
...
and||
with LaTeX versions,\dots
and\mathbin{\|}
(which tend to have better spacing/layout properties)max
andmin
(this improves spacing and I think is a little more readable)\frac{x}{2^n}
tox/2^n
(for inline fractions I find using a solidus works better than\frac
, which is vertically cramped)I don't mind reverting some of these if you prefer the original versions.
Closes #10.