golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.33k stars 17.7k forks source link

proposal: container: add tree-based ordered.Map/Set #60630

Open leaxoy opened 1 year ago

leaxoy commented 1 year ago

There are at least several thousand related implementations OrderedMap/OrderedSet, although we have generics, there are no related containers in the standard library, so people have to implement their own sorting containers.

So I propose add ordered.Map/Set into std, maybe container package is a suitable place.

The main API are:

type Map[K, V any] struct{ /* private fields */ }

// Constructor function
func NewMapFunc[K, V any](cmp func(K, K) int) *Map[K, V] {}
func NewMap[K cmp.Ordered, V any]() *Map[K, V] {}

// Method set
// Insert new pair and return previous value if present.
func (m *Map[K, V]) Insert(k K, v V) (V, bool) {}
// Get get value associated with k, else return empty and false.
func (m *Map[K, V]) Get(k K) (V, bool) {}
// Contains judge k in OrderedMap ?
func (m *Map[K, V]) Contains(k K) bool {}
// Delete entry by key and return stored value if present.
func (m *Map[K, V]) Delete(k K) (V, bool) {}
// Keys return all keys in ascending order.
func (m *Map[K, V]) Keys() iter.Seq[K] {}
// Values return all values in ascending order by key.
func (m *Map[K, V]) Values() iter.Seq[V] {}
func (m *Map[K, V]) Entries() iter.Seq2[K, V] {}
// Clear clear the whole map.
func (m *Map[K, V]) Clear() {}
// Retain will keep those entry which satisfy the given predicate function.
func (m *Map[K, V]) Retain(f func(K, V) bool) {}
// Len return number of entries of map
func (m *Map[K, V]) Len() int {}
// First return the minimum entry.
func (m *Map[K, V]) First() (Entry[K, V], bool) {}
// Last return the maximum entry.
func (m *Map[K, V]) Last() (Entry[K, V], bool) {}
// Reverse reverse order of all entries.
func (m *Map[K, V]) Reverse() iter.Seq2[K, V] {}
func (m *Map[K, V]) Reversed() *Map[K, V] {}
// Merge with another OrderedMap, if key exists, replace it
func (m *Map[K, V]) Merge(o *Map[K, V]) {}
// SubMap copy entries between start and end to a new Map.
func (m *Map[K, V]) SubMap(start K, end K) *Map[K, V] {}

// Wrapper of Map with unit value.
type Set[T any] struct{ inner *Map[T, struct{}] }

// Constrcutor function
func NewSetFunc[T any](cmp func(T, T) int) *Set[T] {}
func NewSet[T cmp.Ordered]() *Set[T] {}

// Method set
func (s *Set[T]) Contains(v T) bool {}
func (s *Set[T]) ContainsAll(seq iter.Seq[T]) bool {}
func (s *Set[T]) ContainsAny(seq iter.Seq[T]) bool {}
func (s *Set[T]) Add(v T) {}
func (s *Set[T]) AddSeq(seq iter.Seq[T]) {}
func (s *Set[T]) Delete(elem T) {}
func (s *Set[T]) DeleteSeq(seq iter.Seq[T]) {}
func (s *Set[T]) First() (T, bool) {} // first element or none
func (s *Set[T]) Last() (T, bool) {} // last element or none
func (s *Set[T]) All() iter.Seq[T] {}
// or
func (s *Set[T]) Values() iter.Seq[T] {} // all elements iterator
func (s *Set[T]) Reverse() iter.Seq[T] {} // reverse order iterator
func (s *Set[T]) Reversed() *Set[T] {} // reverse order then write into another set
func (s *Set[T]) Len() int {} 
func (s *Set[T]) Retain(f func(T) bool) {}
func (s *Set[T]) SupersetOf(o *Set[T]) bool {}
func (s *Set[T]) SubsetOf(o *Set[T]) bool {}
func (s *Set[T]) InterSection(o *Set[T]) *Set[T] {}
func (s *Set[T]) UnionInPlace(o *Set[T]) {}
func (s *Set[T]) Union(o *Set[T]) *Set[T] {}
func (s *Set[T]) Difference(o *Set[T]) *Set[T] {}
func (s *Set[T]) SymmetricDifference(o *Set[T]) *Set[T] {}
func (s *Set[T]) SubSet(start T, end T) *Set[T] {}

Appendix A

  1. In C++, there are ordered_set/ordered_set based on rb-tree
  2. In Java, there are SortedMap and SortedSet.
  3. In Rust, there are BTreeMap and BTreeSet

Appendix B

If we land sum type in go, those return a bool flag would be instead by Option or Maybe, it's intuitive and more suitable. Sum type is more suitable than Product Type here.

Appendix C

Since Iter should be a separate issue, so this proposal would not include Iter api. So move Iter API to end of section.

ianlancetaylor commented 1 year ago
  1. We aren't going to add any general purpose containers until we have an idea of how to implement iterators, aka how to find all the elements in the container. Iterators are pending resolution of #56413, which itself is pending the forward/backward compatibility work that will be part of Go 1.21.
  2. This should perhaps be a discussion first.
earthboundkid commented 1 year ago

There's already a set proposal that is snagged on iteration. This seems related to that. This could be containers/ordered.Set and containers/ordered.Map and have overlapping methods with containers/set.

leaxoy commented 1 year ago
  1. We aren't going to add any general purpose containers until we have an idea of how to implement iterators, aka how to find all the elements in the container. Iterators are pending resolution of user-defined iteration using range over func values #56413, which itself is pending the forward/backward compatibility work that will be part of Go 1.21.
  2. This should perhaps be a discussion first.

@ianlancetaylor I'm agree that Iter is a important feature to iterate over all element in container as you explained, but there also many features does't relay on Iter, such as: map.Get/Insert/Delete/Contains/Clear/IsEmpty/First/Last and set.Insert(Many)/Contains(All/Any)/Delete/Clear/IsEmpty. They are useful in many scenarios.

A benefit is map's Key and set's Elem can any type, not limited to cmp.Ordered.

For those internal maybe relay on Iter, like set.SuperSetOf/SubSetOf/Difference/Union/InterSection and map.Merge/Keys/Values, not exposing implementation details can leave room for the introduction of Iter later.

The rest are Iter related, like: Iter can be ecognized by the compiler and can be used in for-loop, and iterate operations can based on this, including but not limited to ForEach/Map/Filter/Reduce/Fold/Max/Min/AllOf/AnyOf/Take/Skip/Cmp/Zip/Unzip/Chunk/GroupBy ...

At least for CPP, operations on containers can be split into several parts, Iterators/Element Access/Capacity/Modifiers/List Operations/Lookup.. at Containers, Iterators works with std::ranges.

If I understand correctly, what you care about above is the Iterators part

petkostas commented 3 months ago

I am curious why an ordered map cannot be introduced or considered. As stated, there are many different implementations of it, which are increasingly being adopted. Still, they are painful to implement when using third-party libraries (especially when those involve range iterations). Because many languages have already adopted a similar functionality (some examples):

To provide a more concrete example, I am using libopenapi, which uses a custom orderedmap, rendering specifications requires the ordering to be in place, especially when visualizing the results. With the current lack of an ordered map solution, a developer must introduce another solution compatible with the native supporting packages (e.g., templates and range functionality).

petkostas commented 3 months ago

Coming back, with the use of yield things become a bit easier. I gave yield spin today over an iteration of an ordered map that previously required me to convert to a slice and loop over the slice and then over the map. I still believe that adding support for an ordered map/set makes sense.

adonovan commented 2 months ago

The Keys and Values methods (and a new All method) should definitely use go1.23 iterators rather than materializing a slice.

func (*Map[K, V]) Keys() iter.Seq[K]
func (*Map[K, V]) Values() iter.Seq[V]
func (*Map[K, V]) All() iter.Seq2[K, V]

Reverse too could return an iterator rather than a clone:

func (*Map[K, V]) Reverse() iter.Seq2[K, V]

Similarly, SubMap (renamed Range) could be an iterator for a portion of the key space. If you need to construct a new map from a Range (or All), combine the iterator with a InsertAll method that consumes the iterator.

func (*Map[K, V]) Range(from, to K) iter.Seq2[K, V]
func (*Map[K, V]) InsertAll(iter.Seq2[K, V]) bool

What is Entry? Just a pair? Or a reference to a part of the representation? (Java's Map.Entry took the latter choice and IIRC it was a major barrier to optimized special-purpose maps, because they still had to allocate one object per entry.) Is it necessary?

I like that all the operations return the maximum amount of information in a single access (e.g. Insert reports whether there was a previous entry), as these can really improve efficiency.

Retain should be DeleteFunc, with sense inverted.

We should try to harmonize the APIs of related collections types such as sets, lists, maps, queues, and so on. Java 1.5's collections did a good job of establishing conventions across a range of data types so that, for example, they all have an add(T) bool method that reports whether the size of the collection changed (though the API has grown to the point where I can no longer hold in memory).

Naming suggestion: package ordered; type Map; func NewMap{,Func}? Or package maps; type Ordered; func NewOrdered{,Func}?

jimmyfrasche commented 2 months ago

I don't like Range since it clashes with range in a confusing way though I'm not sure what a better name is.

InsertAll should just be Insert because of maps.Insert.

:+1: to container/ordered. maps is for native maps. If there's a type in maps, it should be for wrapping a native map in something with all the usual methods for other container types.

Also, SuperSetOf should be SupersetOf.

DeedleFake commented 2 months ago

I feel like container/ordered is ever so mildly confusing because there are lot ordered container types that make no sense in it, such as just a regular linked list, though I see the appeal of ordered.Map[K, V].

jimmyfrasche commented 2 months ago

@DeedleFake I would expect inserting into an ordered.List to perform the insertion at an index that maintains sort order. So an ordered.Set that allows duplicates.

leaxoy commented 2 months ago

The Keys and Values methods (and a new All method) should definitely use go1.23 iterators rather than materializing a slice.

func (*Map[K, V]) Keys() iter.Seq[K]
func (*Map[K, V]) Values() iter.Seq[V]
func (*Map[K, V]) All() iter.Seq2[K, V]

Reverse too could return an iterator rather than a clone:

func (*Map[K, V]) Reverse() iter.Seq2[K, V]

Similarly, SubMap (renamed Range) could be an iterator for a portion of the key space. If you need to construct a new map from a Range (or All), combine the iterator with a InsertAll method that consumes the iterator.

func (*Map[K, V]) Range(from, to K) iter.Seq2[K, V]
func (*Map[K, V]) InsertAll(iter.Seq2[K, V]) bool

What is Entry? Just a pair? Or a reference to a part of the representation? (Java's Map.Entry took the latter choice and IIRC it was a major barrier to optimized special-purpose maps, because they still had to allocate one object per entry.) Is it necessary?

I like that all the operations return the maximum amount of information in a single access (e.g. Insert reports whether there was a previous entry), as these can really improve efficiency.

Retain should be DeleteFunc, with sense inverted.

We should try to harmonize the APIs of related collections types such as sets, lists, maps, queues, and so on. Java 1.5's collections did a good job of establishing conventions across a range of data types so that, for example, they all have an add(T) bool method that reports whether the size of the collection changed (though the API has grown to the point where I can no longer hold in memory).

Naming suggestion: package ordered; type Map; func NewMap{,Func}? Or package maps; type Ordered; func NewOrdered{,Func}?

@adonovan

Agreed, when this proposal was made, iterators had not yet been implemented. Now that they have been implemented, it makes sense to use iterator syntax.

The existence of Entry is a personal preference. Since seq2 has never been seen in other languages, tuple is used instead, but since go uses seq2, there is no better reason not to use it here.

Short names are indeed better than long names, changed in original proposal.

In my personal implementation it has indeed been the short name for some time

Although I still think seq2 is a less successful abstraction. ^_^

leaxoy commented 2 months ago

I don't like Range since it clashes with range in a confusing way though I'm not sure what a better name is.

InsertAll should just be Insert because of maps.Insert.

👍 to container/ordered. maps is for native maps. If there's a type in maps, it should be for wrapping a native map in something with all the usual methods for other container types.

Also, SuperSetOf should be SupersetOf.

@jimmyfrasche

Since the native map supports direct insertion of a key-value pair at the syntactic level, but this does not work here. If you insert a sequence using insert, then I cannot think of any other suitable naming for inserting the key-value.

How about use InsertSeq or InsertIter instead.

DeedleFake commented 2 months ago

InsertSeq() fits well with slices.AppendSeq().

jba commented 1 month ago

I've been looking into this topic a bit and I'd like to share some of that work. This comment is about the leading ordered map implementation, github.com/google/btree. I think it's relevant to talk about, because if it's great, or maybe even good enough, then it can satisfy the ecosystem's need for an ordered map and there's no compelling reason to add one to the standard library. Python seems to have taken a similar route: there is no ordered map in the Python standard library, but there is at least one high-quality third-party implementation.

The strengths of google/btree (as I'll call it throughout) are:

Its weaknesses are:

Last but not least, google/btree uses copy-on-write. One way to think about that is that cloning, instead of copying all at once, happens incrementally. Each clone copies a few of the shared tree nodes during each modification, until both are fully copied and no nodes are shared. Of course, if those modifications never happen, no copying happens either.

Copy-on-write provides a kind of iteration guarantee. Clone a btree, then iterate over the clone while confining all modifications to the original. The iteration will observe a snapshot of the original's items. Furthermore, the iteration is concurrency-safe (though not the individual keys and values), because the tree nodes are never written—they are always copied first.

Is copy-on-write a strength or a weakness? I would say both. It makes some operations, like iteration and keeping a snapshot of a tree, quite inexpensive. But its performance can be surprising if you are not aware of what's happening inside the implementation. For example, even after the iteration over a clone is finished and the clone becomes garbage, the original will still be copied each time a "shared" node is modified; the implementation has no good way to tell that the nodes are no longer shared.

google/btree is arguably good enough to serve as the ecosystem's ordered map of choice. On the other hand, its quirks and lack of ranged delete leave room to argue that there is a place for a more idiomatic implementation.

jba commented 1 month ago

In this second comment (the first is https://github.com/golang/go/issues/60630#issuecomment-2440277569), I'd like to suggest a modern, idiomatic API for ordered maps. The github.com/jba/omap package is based on Russ Cox's rsc.io/omap but adds a few common operations and an improved range API.

In comparison with github.com/google/btree, this implemention:

If we were to adopt an ordered map implementation, I think it would be a good starting point.

DeedleFake commented 1 month ago

@jba

It does support range-over-func, actually, just not explicitly: https://go.dev/play/p/pc574L-uHyf

leaxoy commented 1 month ago

@jba Most agree, But why is there a MapFunc type? Can it write to Map[K, V any], then use two different constructor variants

func NewMap[K, V any](cmp func(K, K) int) *Map[K, V]
func NewOrderedMap[K Ordered, V any]() *Map[K, V] {
    return NewMap[K, V](cmp.Compare[K])
}
jba commented 1 month ago

why is there a MapFunc type

Map may be much faster.

hgl commented 4 weeks ago

@jba, could Map be an interface?

If any method wants to take another map as an argument, the current two-struct approach is going to make it awkward to implement.

func (m *Map[K, V]) Merge(o *Map[K, V]) {}

Maybe an argument can be make that Map doesn't need such methods, but if Set and SetFunc are implemented on top of them, strong arguments can be made that they should have methods like Union, Join etc. It would ideal if Map and Set could use the same API style.

jba commented 3 weeks ago

@hgl Yes, we could define the interface and have two constructor functions, one for the cmp.Ordered case and one for the comparison function. That would halve the size of the API.

As for set operations, I think it's better to define them in terms of sorted iteration sequences, as in #70140.

fumin commented 1 week ago

@jba I wonder if you can confirm that github.com/jba/omap is strictly superior to github.com/google/btree? If not, what are the things that jba/omap need to improve over google/btree?

P.S. I am a Go user needing an OrderedMap for backing sparse matrices. I am not trying to troll by pitting libraries against each other, just trying to figure out what is the best library at the moment in the absence of a standard library implementation. I also suggest if we decide to leave OrderedMap outside of the standard library, perhaps we could have a Wiki page or blog post to list the recommended/canonical third-party implementations for common data structures. Currently, information on gonuts is too old, and google search results give very confusing picture, with the top search results containing extremely buggy poorly written code. It was only until I stumbled upon this issue did I find a correct path forward for OrderedMaps.

jba commented 1 week ago

I am not trying to troll by pitting libraries against each other

On the contrary, I think it's important to compare these packages.

I don't think there is a clear winner. In https://github.com/golang/go/issues/60630#issuecomment-2440277569 I talk in detail about the strengths and weaknesses of google/btree. The main weaknesses are the clumsy API, the lack of range delete, and the lack of iteration guarantees without doing a possibly expensive clone.

Google/btree is probably faster because jba/omap is unoptimized. For example, jba/omap allocates all new nodes on the heap instead of using a sync.Pool. There are other implementation tricks too that I haven't bothered with because the main purpose of jba/omap is to serve as a reference implementation for this proposal.

So I think it depends on your particular use case. From my point of view, I'd prefer you use jba/omap to help shake out bugs and serious performance issues, and perhaps to contribute improvements. I'd say if you're working on a personal or student project, it would be a more forward-looking choice. If you need something for serious production work, then certainly google/btree has a lot more mileage on it.

perhaps we could have a Wiki page or blog post to list the recommended/canonical third-party implementations for common data structures

The Go team talks about this from time to time, but the problems are that it's a lot of work, and also we want to remain impartial.

fumin commented 1 week ago

@jba Thanks for your helpful comparison! It looks like omap is preferred except for performance critical cases. I will be using it, and would report bugs or new ideas that may be useful.

The Go team talks about this from time to time, but the problems are that it's a lot of work, and also we want to remain impartial.

Understood, and thanks to the entire Go team for this great work all these years!