Open treeowl opened 7 years ago
Are
Empty
subtrees trees allowed everywhere? Nowhere? Some places but not others?
They are allowed everywhere but as an optimization they should only appear in an empty map. The point of BitmapIndexed
is to avoid having to store pointers to Empty
in the tree. It would be possible to refactor the data structure to
data HashMap k v = Empty | Tree k v
data Tree k v = ... -- the rest
I don't know if that's a win or not. One related thing I've considered is to have
data HashMap k v = Empty | Root k v
data Root k v = Root !Size !(Tree k v)
data Tree k v = ...
if we ever wanted to stored the size (for O(1) lookup). Function like insert
would have to return the diff in size and update the root.
Exactly how is the bitmap represented? That seems relatively tricky, and particularly important for improving the test suite.
Just like the HAMT data structure presented in the original paper: the N bits of the bitmap represent the N-subtrees that are present. The array only contains the present subtrees with no empty trees in-between. We use popcount to go between the logical and compact indices.
What is the endianness of hash consumption?
Are you asking which end of the hash we use first when indexing or are you asking how hashable hashes various machine word-sized types?
On Jun 26, 2017 11:56 PM, "Johan Tibell" notifications@github.com wrote:
Are Empty subtrees trees allowed everywhere? Nowhere? Some places but not others?
They are allowed everywhere but as an optimization they should only appear in an empty map. The point of BitmapIndexed is to avoid having to store pointers to Empty in the tree.
Well, if they're not supposed to be in non-empty maps, that can be part of a validity check.
It would be possible to refactor the data structure to
data HashMap k v = Empty | Tree k vdata Tree k v = ... -- the rest
I don't know if that's a win or not.
Well, it adds an indirection, which would matter if you have a container of maps. I really wish the newtype optimization were available for GADTs under special circumstances; such circumstances apply here and would make this free.
One related thing I've considered is to have
data HashMap k v = Empty | Root k vdata Root k v = Root !Size !(Tree k v)data Tree k v = ...
if we ever wanted to stored the size (for O(1) lookup). Function like insert would have to return the diff in size and update the root.
I don't see how that will work for unions or intersections.
Exactly how is the bitmap represented? That seems relatively tricky, and particularly important for improving the test suite.
Just like the HAMT data structure presented in the original paper: the N bits of the bitmap represent the N-subtrees that are present. The array only contains the present subtrees with no empty trees in-between. We use popcount to go between the logical and compact indices.
So the popcount of the bitmap must equal the size of the array? That sounds like another good fact for generating test cases and checking validity. Given a logical index, we mask off all but that many bits of the bitmap and popcount to get the compact index? Given a compact index, I don't see a particularly nice way to get the logical index; does that use a binary search, or is it just never necessary?
What is the endianness of hash consumption?
Are you asking which end of the hash we use first when indexing or are you asking how hashable hashes various machine word-sized types?
I meant which end of the hash is used first, but I guess the Hashable question must interact with that for performance.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tibbe/unordered-containers/issues/161#issuecomment-311246147, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_ZBJH_HQHyFzZpq44P1Y6D-lRj7eks5sIH14gaJpZM4OFmPM .
Well, if they're not supposed to be in non-empty maps, that can be part of a validity check.
Sure. Perhaps add a macro-like check like we have for other things. You might have to add it to every function that mutates the map however. Or perhaps use smart constructors that validate the invariants.
I don't see how that will work for unions or intersections.
I haven't thought it through. Perhaps we'd just recompute the size from scratch while doing the union. Each recursive call to union could return the subtree size.
I meant which end of the hash is used first, but I guess the Hashable question must interact with that for performance.
It uses the low order bits first. See D.HM.Base.index
. Either could work, it depends on which side you expect to be better behaved. For example, if you were storing pointer and the Hashable
instance for those was identity then you'd expect the lowest three bits to be zero on 64-bit platforms due to alignment.
On Tue, Jun 27, 2017 at 8:49 PM, Johan Tibell notifications@github.com wrote:
Well, if they're not supposed to be in non-empty maps, that can be part of a validity check.
Sure. Perhaps add a macro-like check like we have for other things. You might have to add it to every function that mutates the map however. Or perhaps use smart constructors that validate the invariants.
I just meant that the test suite can say
valid result .&&. ....
in each test.I don't see how that will work for unions or intersections.
I haven't thought it through. Perhaps we'd just recompute the size from scratch while doing the union. Each recursive call to union could return the subtree size.
Can't unions skip all the processing of subtrees that are only found in one of the maps? Descending them all just to count could be pricy.
I meant which end of the hash is used first, but I guess the Hashable question must interact with that for performance.
It uses the low order bits first. See D.HM.Base.index. Either could work, it depends on which side you expect to be better behaved. For example, if you were storing pointer and the Hashable instance for those was identity then you'd expect the lowest three bits to be zero on 64-bit platforms due to alignment.
I'd expect the pointer hash to rotate three bits to the right.
Can't unions skip all the processing of subtrees that are only found in one of the maps? Descending them all just to count could be pricy.
Possibly. As I said I haven't thought this through fully.
I'd expect the pointer hash to rotate three bits to the right.
Yes that's a common approach (or xor high and low bits).
Are
Empty
subtrees trees allowed everywhere? Nowhere? Some places but not others?They are allowed everywhere but as an optimization they should only appear in an empty map. The point of
BitmapIndexed
is to avoid having to store pointers toEmpty
in the tree.
Based on https://github.com/haskell-unordered-containers/unordered-containers/issues/154#issuecomment-657648901, it seems that the Eq
instance relies on Empty
only existing at the top-level, not as a subtree.
Correct. Empty
only for the whole tree.
For reference, there was some discussion of invariants in https://github.com/haskell-unordered-containers/unordered-containers/pull/282#discussion_r453606996.
IMHO it's very important that these invariants are documented (ideally in the code), and also tested.
Better documentation in proper prose and with examples is still needed though.
EDIT: #425 and #450 are related.
Off the top of my head:
Empty
subtrees trees allowed everywhere? Nowhere? Some places but not others?