haskell-unordered-containers / unordered-containers

Efficient hashing-based container types
BSD 3-Clause "New" or "Revised" License
221 stars 97 forks source link

Add more detailed data structure information #161

Open treeowl opened 7 years ago

treeowl commented 7 years ago

Off the top of my head:

tibbe commented 7 years ago

Are Empty subtrees trees allowed everywhere? Nowhere? Some places but not others?

They are allowed everywhere but as an optimization they should only appear in an empty map. The point of BitmapIndexed is to avoid having to store pointers to Empty in the tree. It would be possible to refactor the data structure to

data HashMap k v = Empty | Tree k v
data Tree k v = ...  -- the rest

I don't know if that's a win or not. One related thing I've considered is to have

data HashMap k v = Empty | Root k v
data Root k v = Root !Size !(Tree k v)
data Tree k v = ...

if we ever wanted to stored the size (for O(1) lookup). Function like insert would have to return the diff in size and update the root.

Exactly how is the bitmap represented? That seems relatively tricky, and particularly important for improving the test suite.

Just like the HAMT data structure presented in the original paper: the N bits of the bitmap represent the N-subtrees that are present. The array only contains the present subtrees with no empty trees in-between. We use popcount to go between the logical and compact indices.

What is the endianness of hash consumption?

Are you asking which end of the hash we use first when indexing or are you asking how hashable hashes various machine word-sized types?

treeowl commented 7 years ago

On Jun 26, 2017 11:56 PM, "Johan Tibell" notifications@github.com wrote:

Are Empty subtrees trees allowed everywhere? Nowhere? Some places but not others?

They are allowed everywhere but as an optimization they should only appear in an empty map. The point of BitmapIndexed is to avoid having to store pointers to Empty in the tree.

Well, if they're not supposed to be in non-empty maps, that can be part of a validity check.

It would be possible to refactor the data structure to

data HashMap k v = Empty | Tree k vdata Tree k v = ... -- the rest

I don't know if that's a win or not.

Well, it adds an indirection, which would matter if you have a container of maps. I really wish the newtype optimization were available for GADTs under special circumstances; such circumstances apply here and would make this free.

One related thing I've considered is to have

data HashMap k v = Empty | Root k vdata Root k v = Root !Size !(Tree k v)data Tree k v = ...

if we ever wanted to stored the size (for O(1) lookup). Function like insert would have to return the diff in size and update the root.

I don't see how that will work for unions or intersections.

Exactly how is the bitmap represented? That seems relatively tricky, and particularly important for improving the test suite.

Just like the HAMT data structure presented in the original paper: the N bits of the bitmap represent the N-subtrees that are present. The array only contains the present subtrees with no empty trees in-between. We use popcount to go between the logical and compact indices.

So the popcount of the bitmap must equal the size of the array? That sounds like another good fact for generating test cases and checking validity. Given a logical index, we mask off all but that many bits of the bitmap and popcount to get the compact index? Given a compact index, I don't see a particularly nice way to get the logical index; does that use a binary search, or is it just never necessary?

What is the endianness of hash consumption?

Are you asking which end of the hash we use first when indexing or are you asking how hashable hashes various machine word-sized types?

I meant which end of the hash is used first, but I guess the Hashable question must interact with that for performance.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tibbe/unordered-containers/issues/161#issuecomment-311246147, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_ZBJH_HQHyFzZpq44P1Y6D-lRj7eks5sIH14gaJpZM4OFmPM .

tibbe commented 7 years ago

Well, if they're not supposed to be in non-empty maps, that can be part of a validity check.

Sure. Perhaps add a macro-like check like we have for other things. You might have to add it to every function that mutates the map however. Or perhaps use smart constructors that validate the invariants.

I don't see how that will work for unions or intersections.

I haven't thought it through. Perhaps we'd just recompute the size from scratch while doing the union. Each recursive call to union could return the subtree size.

I meant which end of the hash is used first, but I guess the Hashable question must interact with that for performance.

It uses the low order bits first. See D.HM.Base.index. Either could work, it depends on which side you expect to be better behaved. For example, if you were storing pointer and the Hashable instance for those was identity then you'd expect the lowest three bits to be zero on 64-bit platforms due to alignment.

treeowl commented 7 years ago

On Tue, Jun 27, 2017 at 8:49 PM, Johan Tibell notifications@github.com wrote:

Well, if they're not supposed to be in non-empty maps, that can be part of a validity check.

Sure. Perhaps add a macro-like check like we have for other things. You might have to add it to every function that mutates the map however. Or perhaps use smart constructors that validate the invariants.

I just meant that the test suite can say valid result .&&. .... in each test.

I don't see how that will work for unions or intersections.

I haven't thought it through. Perhaps we'd just recompute the size from scratch while doing the union. Each recursive call to union could return the subtree size.

Can't unions skip all the processing of subtrees that are only found in one of the maps? Descending them all just to count could be pricy.

I meant which end of the hash is used first, but I guess the Hashable question must interact with that for performance.

It uses the low order bits first. See D.HM.Base.index. Either could work, it depends on which side you expect to be better behaved. For example, if you were storing pointer and the Hashable instance for those was identity then you'd expect the lowest three bits to be zero on 64-bit platforms due to alignment.

I'd expect the pointer hash to rotate three bits to the right.

tibbe commented 7 years ago

Can't unions skip all the processing of subtrees that are only found in one of the maps? Descending them all just to count could be pricy.

Possibly. As I said I haven't thought this through fully.

I'd expect the pointer hash to rotate three bits to the right.

Yes that's a common approach (or xor high and low bits).

sjakobi commented 4 years ago

Are Empty subtrees trees allowed everywhere? Nowhere? Some places but not others?

They are allowed everywhere but as an optimization they should only appear in an empty map. The point of BitmapIndexed is to avoid having to store pointers to Empty in the tree.

Based on https://github.com/haskell-unordered-containers/unordered-containers/issues/154#issuecomment-657648901, it seems that the Eq instance relies on Empty only existing at the top-level, not as a subtree.

treeowl commented 4 years ago

Correct. Empty only for the whole tree.

sjakobi commented 4 years ago

For reference, there was some discussion of invariants in https://github.com/haskell-unordered-containers/unordered-containers/pull/282#discussion_r453606996.

IMHO it's very important that these invariants are documented (ideally in the code), and also tested.

sjakobi commented 2 years ago

444 has made some progress on documenting and checking the invariants.

Better documentation in proper prose and with examples is still needed though.

EDIT: #425 and #450 are related.