haskell-unordered-containers / unordered-containers

Efficient hashing-based container types
BSD 3-Clause "New" or "Revised" License
221 stars 97 forks source link

"Subkey" concept is undocumented #425

Open sjakobi opened 2 years ago

sjakobi commented 2 years ago

There are two constants named using this term, but I can't find any other use of it in Bagwell's paper or the context of HAMTs.

https://github.com/haskell-unordered-containers/unordered-containers/blob/4da2c201a274f7f3306117b479c51aef7af61129/Data/HashMap/Internal.hs#L2361-L2362

https://github.com/haskell-unordered-containers/unordered-containers/blob/4da2c201a274f7f3306117b479c51aef7af61129/Data/HashMap/Internal.hs#L2367-L2368

From this context I suspect that a subkey is a 5-bit part of a hash. I.e. a 64-bit or 32-bit hash can be understood as a sequence of subkeys. For example:

hash32 = 0b10_00101_00100_00011_00010_00001_00000
subkeys(hash32) = [0b00000, 0b00001, 0b00010, 0b00011, 0b00100, 0b00101, 0b10]

Subsequent subkeys of a given hash are used to navigate the tree at subsequent levels:

index(hash, shift) = subkeys(hash) !! (shift / bitsPerSubkey)
sjakobi commented 2 years ago

@treeowl Does my explanation above make sense to you, or do you have a different understanding?

sjakobi commented 2 years ago

I wonder whether "sub-hash" might be a better name, since the term "key" usually refers to a k type thing instead of its hash.

sjakobi commented 2 years ago

Simply 'Index' might be another good name.

It would be good to have a proper explanation of how a HashMap tree can be navigated by following the "index path" represented by a hash. The invariant on correctly located leaf nodes could then reference this explanation.

treeowl commented 2 years ago

I think your explanation is good. I don't know which word to use. index might be simplest in context.

sjakobi commented 2 years ago

type Index = Word

indexPath :: Hash -> [Index]
indexPath = go 0
  where
     go s !h | s >= finiteBitSize @Word 0 = []
             | otherwise = index h s : go (nextShift s) h
treeowl commented 2 years ago

The problem with index is that it might be confused with an array index.

sjakobi commented 2 years ago

The problem with index is that it might be confused with an array index.

That's not entirely wrong though. The existing index function does compute the array index for the array in a Full node:

https://hackage.haskell.org/package/unordered-containers-0.2.19.1/docs/Data-HashMap-Internal.html#v:index

treeowl commented 2 years ago

Right, but the index is not the subkey.

sjakobi commented 2 years ago

Right, but the index is not the subkey.

Yes, it is?! It's just two names for the same concept: A 5-bit chunk of a Hash that can be used to index into a Bitmap or a Full node's array.

treeowl commented 2 years ago

You have to combine the the subkey with the mask to get the array index, don't you?

sjakobi commented 2 years ago

You have to combine the the subkey with the mask to get the array index, don't you?

I think you're referring to BitmapIndexed nodes now, where the array indices are computed with sparseIndex using the Bitmap stored in the node.

For a Full node, all you do is right-shift the hash by the Shift corresponding to the desired level and use the subkeyMask to get the lowest 5 bits. Then you have the index into the array.

I'll try to document all this stuff properly and let you review.

treeowl commented 2 years ago

Yeah, that's what I was talking about for sure. I think the subkey thing makes sense as a unifying principle there. But of course getting rid of Full would be better. I doubt I'll have time to work on that in the next few weeks though.

sjakobi commented 2 years ago

Even if we get rid of Full, this sub-key / index concept will remain essential to understand the HashMap internals.

treeowl commented 2 years ago

Yes. I feel like we're talking past each other somehow? I had the impression you wanted to rename subkey to index. Was that a bogus misinterpretation on my part?

sjakobi commented 2 years ago

I had the impression you wanted to rename subkey to index. Was that a bogus misinterpretation on my part?

No. I think this renaming (e.g. bitsPerSubKey -> bitsPerIndex) makes sense, as long as its sufficiently clear that an Index can not be used directly for the array stored in a BitmapIndexed node.

If making this distinction seems too hard, I'd settle for "sub-hash", e.g. bitsPerSubHash. In that case we should probably also rename the existing index function to subHash.

treeowl commented 2 years ago

I think subhash is clearer than index. I don't personally have trouble with subkey either. My main confusion is over which direction the bits go in a BitmapIndexed, since the bitwise operations don't have any examples or description of what they actually do to the bits.

sjakobi commented 2 years ago

My main confusion is over which direction the bits go in a BitmapIndexed, since the bitwise operations don't have any examples or description of what they actually do to the bits.

I did add some examples in #396. Feel free to add more or to point out what's missing!