Open nikita-volkov opened 11 years ago
@scravy If you agree, I'll get into it
@scravy Although I'm a bit conflicted here, since I have a strong suspicion that we should simply ditch the list-based implementation altogether, since I suspect it to be a notably worse performer. On the other hand, we can't jump to conclusions without benchmarking and profiling... So yeah, let's do it as it's proposed in the current issue for now, keeping in mind that we should do benchmarking and profiling some time later.
I have to admit that I would be more comfortable with Data.MultiMap being the Set based implementation, frankly: I always was. It fits better with what people expect from a multi map, and it fits better with what I expect from a multi map.
The reason for why MultiMap is backed by a List is that it does not impose any constraints on values (just like in the normal Map from Data.Map). The set based implementation on the other hand does impose constraints on values, notably Ord. Also: While the set based implementation will surely be faster in several cases (complex operations like union, intersect, etc.), I doubt that it is for the most basic insert operation. The list based implementation does not have any overhead other than looking up the key, since inserting into a list is a constant operation. Inserting into a Set as in Data.Set on the other rand requires an additional non-constant cost, depending on the number of values in the set. Plus it involves a certain overhead for performing the comparison induced by Ord and the storage-strategy of Sets, which for certain data types might be a rather costly operation.
Having said that, regarding the naming of the classes I do not think that SetMap is a bad name per se. I also do not think that Data.MultiMap.Set is a good name. I even doubt that that's a standard approach to naming conventions. Looking through the packages in the Haskell Platform I see things like IntSet, ByteString, or MArray, but no Array.Mutable, String.Bytes, or Set.Int. The thing about XY.Strict and XY.Lazy or XY.Safe or XY.Unsafe is that the implementations fulfill the same contract, i.e. they behave in the very same way except for a certain difference regarding the order of evaluations, or possible side effects, or whatever.
So here's the thing: The SetMap and the ListMap are in fact quite different from each other and do not have the exact same behavior. SetMap requires Ord for the values, ListMap requires no context at all. But most importantly: what I have just called a ListMap is a Map that allows duplicate values per key and maintains positions of these values in the collection which is associated with a key. A SetMap on the other hand does not allow duplicate values per key (but the way a Set is implemented it's an ordered set in the end).
To summarize: I do think that the behavior of a SetMap is what most people (including me) would expect from a MultiMap and therefor SetMap should be MultiMap. Furthermore there is a behavioral difference between a SetMap and a ListMap, which I think should be reflected in the name. Also there is a nice benefit in performance for a ListMap regarding the simple insertion operation (and I believe that map should be faster too) (the last word on this will have a benchmark). ListMap is however a poor choice for a name, since while IntSet is a Set of Ints, a ListMap is not a Map of Lists but a MultiMap backed by Lists, allowing for duplicate keys. MultiMapWithDuplicateValue is however not a very sane name.
Thus: I would definitely rename Data.SetMap to Data.MultiMap and make it the multi map in this package.
I do however feel (and in fact I do have that need, since I do have data to be kept in a MultiMap for which there is no instance of Ord and for which I do want to know how often it is associated with a given key) that there is a need for the list based MultiMap which is defined by its allowance for duplicate values. Maybe Data.DuplicateValuesMultiMap is not such a bad name after all? Alternatively DVMultiMap might be an, at least shorter, choice (much in the same way as CaseInsensitiveStrings are CIStrings, and MutableArrays are MArrays, etc.).
The main thing about list implementation that bugs me is that looking up inside a list is O(n). This will affect "delete" and "update" operations, since they require looking up in the collection-type elements of the underlying map. In case when a multimap has many values under a single key, this will roughly make those operations O(n), which is simply an awful performance.
Concerning the naming conventions, I was referring to the following modules:
Data.Map.Lazy
and Data.Map.Strict
Data.IntMap.Lazy
and Data.IntMap.Strict
Data.HashMap.Lazy
and Data.HashMap.Strict
Data.ByteString.Lazy
and Data.ByteString.Strict
They also all come in with default aliases, such as Data.Map
forwarding to Data.Map.Lazy
.
Also I want to bring your attention to the following argument again:
the end user will be able to simply switch the implementation by updating the import statement.
I am aware of what you are referring to. Did you actually read what I wrote (no offense, but: did you?).
Yes, I did. What are you referring to?
Data.Map.Lazy and Data.Map.Strict offer the exact same interface, while Data.MultiMap.List and Data.MultiMap.Set do not, thus it is not possible to switch the implementation by updating the import statement. The set backed multimap requires values of type forall v. Ord v => v, whereas Data.MultiMap.List does not. Also the List-based multi map allows for duplicate entries which clearly is not the same behavior as the set based multi map.
What do you think about my proposal?
I see that the list based implemenation does come with a performance penalty on certain (if not most) operations.
Ord
argument weak. What is the problem with introducing an Ord
instance for any type? Using deriving instance
it's even brain-free.One question is "what is a multimap". A data type which associates keys with values. With respect to such a definition both the ListMap and the SetMap are multimaps. They are however different kinds of multimaps, the one allowing for the duplicate association of values with a key the other one allowing no duplicates. Possible implementations for a MultiMapWithDuplicates are Map k [v] or Map k (Map v HowMuch). Implementations for MultiMapWithoutDuplicates are Map k (Set v) or Map k [v]. Both kinds of multimaps are valid multi maps IMHO (In fact I would like to call them SetMultiMap and BagMultiMap, which can each be implemented differently).
I do think that a multimap not allowing for multiple associations of a value with a key are what most people definitely think of when they hear the term multimap, but there are also other possible multimaps in their own right.
A list based SetMultiMap is of course non sense. Updating and deleting is much worse than in a set based implementation, also insertion is much slower than in the set based implementation, since the whole list needs to be scanned for duplicates. I'm completely d'accord with this view of things. Thus ditching a list based implementation of a SetMultiMap is what I would definitely go for.
However the MultiMap backed by a list in this library is that what I called a BagMultiMap and I would like to keep it around. It is surely needed much more seldom and maybe an implementation with a Map of Maps would be better.
I still suggest that there be only one SetMultiMap, called MultiMap, implemented as Map k (Set v). AFAIK we both want the same thing in that regard.
Leaves us with the question of whether to keep "that other kind of multimap", whether to implement it via Lists, and how to name it. I am pro keeping it, with a name such as "BagMultiMap", "DVMultiMap", "DuplicateValueMultiMap", "ListMultiMap" or even "ListMap", whatever we feels fits best. If you feel too uncomfortable with such a weird multimap hanging around in this library I do not have a problem with creating a multimaps library which dependes on multimap for a SetMultiMap and introduces another multimap.
ditching a list based implementation of a SetMultiMap is what I would definitely go for.
There was just a single reason why I wasn't against keeping a list-based implementation of what you call a SetMultiMap
and it's the possible lighter memory footprint. It may very well become an important factor for people seeking for that specific property. Although again we can't conclude anything about that without benchmarking and profiling. Since in my view it's a very low-priority issue, I agree not to include that kinda implementation in the library for now.
Possible implementations for a MultiMapWithDuplicates are Map k [v] or Map k (Map v HowMuch). However the MultiMap backed by a list in this library is that what I called a BagMultiMap and I would like to keep it around. It is surely needed much more seldom and maybe an implementation with a Map of Maps would be better.
If we keep it my vote is definitely for switching to Map k (Map v HowMuch)
because of an O(log n) in-cell lookups against O(n) of Map k [v]
.
If you feel too uncomfortable with such a weird multimap hanging around in this library I do not have a problem with creating a "multimaps" library which dependes in multimap for a SetMultiMap and introduces another multimap.
I do believe that "multimaps" is a better title for a library of different data-structures. On the other hand, we're already here with a "multimap" title, we can't delete this package and can't rename it, all we can do is deal with it. I don't think that breeding related packages will give us any notable benefits, but it definitely will raise a confusion concerning those libraries for both the end-users and ourselves.
Any way, you've provided some good arguments to keep what you call a BagMultiMap
, so I agree to keep it. But I have to inform you that I do not plan to maintain it. I only have time for the set-based implementation.
I still suggest that there be only one SetMultiMap, called MultiMap, implemented as Map k (Set v). AFAIK we both want the same thing in that regard.
I am not sure now. Since now we've established that the library should contain multiple competing implementations that behave differently (as opposed to just having different underlying implementations, but behaving identically, as I thought previously), it doesn't look very nice to assign a winner title to any specific one. So in this regard I believe we just need to come up with some good titles for both implementations and use no aliases. Of your suggestions I'm leaning towards Data.ListMultiMap
and Data.SetMultiMap
, but I'm not sure yet.
Okay, so it would be two different MultiMaps, SetMultiMap and
For the moment being we could just use the list-based implementation as BagMultiMap, and I'd rewrite it to a Map of Maps (probably next month). That what is MultiMap.Set in your branch would simply go as SetMultiMap. If we'd ever offer multiple implementations for SetMultiMaps or BagMultiMaps I'd be perfectly happy with adopting the naming scheme that you proposed, i.e. SetMultiMap.Set and SetMultiMap.Whatever, with SetMultiMap exporting the API of one of them two. For a 1.3-version it surely suffices to provide Data.SetMultiMap and Data.BagMultiMap .
@scravy I agree with almost everything you said. However concerning the titles there are a few things that still bug me:
On the other hand both are succinct, which is good, and I don't have any alternatives to suggest. I guess we can go with those titles for now, but I would like to leave the question of naming open in case we come up with some better titles later. That is, the abstract question, not the current issue.
The current "MultiMap/SetMap" approach feels kinda icky and makes
SetMap
sound like it has nothing to do with multimap whatsoever.I suggest to adapt the standard approach to naming data-structures having alternative implementations. Thus we may end up with the following modules structure:
Data.MultiMap.List
Data.MultiMap.Set
Data.MultiMap
- an alias to either of the implementations as a default oneInside the implementation modules they'll both have definitions of multimap named
MultiMap
and the end user will be able to simply switch the implementation by updating the import statement.