haskell-unordered-containers / unordered-containers

Efficient hashing-based container types
BSD 3-Clause "New" or "Revised" License
221 stars 99 forks source link

Add general merge API #226

Open treeowl opened 5 years ago

treeowl commented 5 years ago

Data.Map and Data.IntMap offer merge and mergeA operations that can take unions, intersections, differences, etc. We should be able to do something quite similar here. I suspect optimal merge will have to be a bit different from mergeA rather than just a specialization, thanks to the way arrays work. Also, as I've explored in a slightly different context, mergeA can probably be implemented particularly efficiently for certain transformer stacks on top of IO or ST s, again thanks to the mutable array business. Dunno if that sort of thing is worth bothering with.

sjakobi commented 4 years ago

This addition would help us address the feature requests #240 and #242.

I guess implementing merge and testing it against the Data.Map version would be a good first step!?

sjakobi commented 4 years ago

This should be the right type, no?

merge :: (Hashable k, Eq k)
             => SimpleWhenMissing k a c -- ^ What to do with keys in @m1@ but not @m2@
             -> SimpleWhenMissing k b c -- ^ What to do with keys in @m2@ but not @m1@
             -> SimpleWhenMatched k a b c -- ^ What to do with keys in both @m1@ and @m2@
             -> HashMap k a
             -> HashMap k b
             -> HashMap k c

Could the merge strategies ([Simple]WhenMissing etc) simply be copied from http://hackage.haskell.org/package/containers-0.6.2.1/docs/Data-Map-Merge-Lazy.html?

treeowl commented 4 years ago

That's the general idea. You'll probably want to look at the Data.IntMap version too. The hard part is figuring out how mergeA should work. The rest is much easier.

sjakobi commented 4 years ago

@BebeSparkelSparkel would you possibly be interested in working on this?

BebeSparkelSparkel commented 4 years ago

I am concerned about the the performance of foldr (merge f0 f1 f2) mempty (see below) in my use case #240 . It looks like it would create a lot of maps during the fold, but maybe I'm wrong about that.

sjakobi commented 4 years ago

@BebeSparkelSparkel Note that unions is also simply defined as foldl' union empty.

Do you have a more efficient implementation in mind?

treeowl commented 4 years ago

I doubt a much more generally efficient implementation of unions is possible. It'll either be foldl' union empty or foldr' union empty, depending on the Foldable.

BebeSparkelSparkel commented 4 years ago

I was mistaken. foldr (merge f0 f1 f2) mempty is incorrect. What I actually need is more like fmap f0 . foldr (merge f1 f2 f3) mempty if using merge.

240 details Foldable f => (NonEmpty a -> b) -> f (HashMap k a) -> HashMap k b has the merge and map while only creating the resulting HashMap.

It seems to me like #226 and #240 are not addressing the same problem.

sjakobi commented 4 years ago

I think I'll give merge a spin myself. It looks like a nice exercise for me to better understand the internals.

svenkeidel commented 4 years ago

@sjakobi, any progress on this? I could help. In Sturdy we need a mergeA function to implement efficient widening operators.

sjakobi commented 4 years ago

@svenkeidel I have started working on merge in https://github.com/haskell-unordered-containers/unordered-containers/compare/sjakobi/226-merge, but realized that it's a bit much for a first project for someone who isn't really familiar with the internals.

I'd be very happy if you would take this over and implement mergeA and/or merge!

I'm curious about your usecase though! Does the order in which we sequence effects matter? I wasn't sure whether we could do anything better than keeping it completely arbitrary and unspecified.

svenkeidel commented 4 years ago

I'm curious about your usecase though! Does the order in which we sequence effects matter? I wasn't sure whether we could do anything better than keeping it completely arbitrary and unspecified.

I need to combine two maps, while simultaneously finding out if a key in one of the maps was missing or one of the values in the resulting map has changed (https://github.com/svenkeidel/sturdy/blob/master/lib/src/Data/Abstract/Map.hs#L57-L67). mergeA is the perfect fit for that. The order of effects does not matter.

More generally, I think the order of map entries should not matter since hashmaps are inherently unordered, hence the name of the library. If users rely on a predictable order of entries within the map, they have to use a different datastructure.

treeowl commented 2 years ago

@oberblastmeister, how would you like to tackle this challenge while merging mechanics are still fresh in your mind?

oberblastmeister commented 2 years ago

@treeowl yes I would like to do this, but maybe after I do difference?

treeowl commented 2 years ago

If you think that'll help you prepare! Another option would be to use mergeA to implement difference.

oberblastmeister commented 2 years ago

Yeah I think I'll use mergeA to implement difference. Though probably merge, because applicatives are bad on arrays right?

sjakobi commented 2 years ago

I'd feel more comfortable about adding functions like mergeA once we have tests for the internal invariants (https://github.com/haskell-unordered-containers/unordered-containers/issues/366).

oberblastmeister commented 2 years ago

Are we planning to make merge public, or only use it to implement functions like intersection, differrence, etc. Looking at containers merge seems to be internal.

sjakobi commented 2 years ago

merge should be public, as it is in containers. See e.g. https://hackage.haskell.org/package/containers-0.6.5.1/docs/Data-Map-Merge-Strict.html#v:merge.

oberblastmeister commented 2 years ago

Woops, I must have clicked the wrong one

sjakobi commented 2 years ago

I'd feel more comfortable about adding functions like mergeA once we have tests for the internal invariants (#366).

This is resolved BTW: basic tooling for invariant checking was added in #444. More tests are being added in #455.