golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.38k stars 17.71k forks source link

proposal: container/unordered: a generic hash table with custom hash function and equivalence relation #69559

Open adonovan opened 2 months ago

adonovan commented 2 months ago

Background: In https://github.com/golang/go/issues/69420 I proposed to promote the golang.org/x/tools/go/types/typeutil.Map data type to the go/types package, modernizing it with generics. @jimmyfrasche pointed out that really the only part of that proposal that needs to be in go/types is the types.Hash function, which should have semantics consistent with types.Identical: that is, identical types must have equal hashes. So, I reduced that proposal to just the hash function, and this proposal governs the table.

Proposal: the standard library should provide a generic hash table that allows the user to specify the hash function and equivalence relation. Here is a starting point for the API:

package unordered // "container/unordered"

import "iter"

// Map[K, V] is a mapping from keys of type K to values of type V.
// Map keys are considered equivalent if they are.
type Map[K, V any] struct { ... }

// NewMap returns a new mapping.
// Keys k1, k2 are considered equal if eq(k1, k2); in that case hash(k1) must equal hash(k2).
type NewMap[K, V any](hash func(K) uint64, eq func(K, K) bool) *Map[K, V]

// All returns an iterator over the key/value entries of the map in undefined order.
func (m *Map[K, V]) All() iter.Seq2[K, V]

// At returns the map entry for the given key. It returns zero if the entry is not present.
func (m *Map[K, V]) At(key K) V

// Delete removes th//e entry with the given key, if any. It returns true if the entry was found.
func (m *Map[K, V]) Delete(key K) bool

// Keys returns an iterator over the map keys in unspecified order.
func (m *Map[K, V]) Keys() iter.Seq[K]

// Values returns an iterator over the map values in unspecified order.
func (m *Map[K, V]) Values() iter.Seq[V]

// Len returns the number of map entries.
func (m *Map[K, V]) Len() int

// Set updates the map entry for key to value, and returns the previous entry, if any.
func (m *Map[K, V]) Set(key K, value V) (prev V)

// String returns a string representation of the map's entries in unspecified order.
// Values are printed as if by fmt.Sprint.
func (m *Map[K, V]) String() string

Open questions:

Related:

gabyhelp commented 2 months ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

prattmic commented 2 months ago

Your proposal has several semantic differences from the builtin map type. Are these intentional or oversights?

Ones I noticed:

  1. No way to provide a size hint to NewMap.
  2. Modification during iteration semantics are unspecified. Is it OK to add/delete from the map during iteration?
  3. At doesn't provide a way to tell if the key was present (ok) (you have this one in open questions).
  4. Delete reports whether the key existed. Builtin delete does not.
  5. Set returns the previous value, builtin map assignment does not.
  6. String "returns a string representation of the map's entries in unspecified order"; fmt.Print of a builtin map sorts the keys if possible.

On the implementation detail side, if we keep semantics identical to builtin maps, I suspect we could share an implementation (except for iteration). If semantics differ slightly, I think sharing would be difficult, as the builtin maps are very performance sensitive, so adding special cases may hurt overall performance. I worry a bit about maintenance burden about having many map implementations (builtin, sync.Map, this one, #60630), especially if we want them all to be high performance.

adonovan commented 2 months ago

All good questions. My personal feelings are:

More important to me than any of these items is that the designs for ordered map, unordered map and set are considered together.

benhoyt commented 2 months ago

I would much prefer the name Get instead of At. Seems to parallel Set, and matches existing similar functions I've seen. I recommend Get returning the zero value if the key is not present (simpler to use most of the time) but then having a Lookup method which returns an ok bool as well (a pain if it's not available when you do need it). This is like reflect.StructTag.Get and .Lookup, and os.Getenv vs LookupEnv.

Stable String order is very useful IMO. I think "stable" is better than "sorted", eg "10" before "2" is okay as long as it's well defined. The main thing you want is stability for comparisons (normally in tests, but not always).

Size hint - I think a sensible default is normally fine. We can always add NewMapSize later on if really needed.

I think needing the previous value in Set and Delete would be extremely rare. I'd slightly prefer not having it for a simpler API.

benhoyt commented 2 months ago

Out of interest, would this work for starlark-go's maps, or is that too specific a use case?

adonovan commented 2 months ago

I think needing the previous value in Set and Delete would be extremely rare. I'd slightly prefer not having it for a simpler API.

It's quite common to want the previous value back from set, either for its value, or to distinguish insert from update at no cost. This can be done by calling Len() before and after Set, but I think all our collections types should provide at least a boolean result from Set. Delete too for similar reasons, though deletions are rare in practice.

I agree with your other thoughts.

Out of interest, would this work for starlark-go's maps, or is that too specific a use case?

Starlark maps are insertion-ordered, and we definitely aren't promising that here. (I think the other semantics about mutation, freeze, and iteration could be implemented as a wrapper.)

jimmyfrasche commented 2 months ago

Meta:

So there are proposals for

From that, it seems like there are missing entries

atdiar commented 2 months ago

Not sure that the ordered map/set are a good idea in the stdlib. They can be easily implemented outside and the key order was randomized for a reason.

Just so that there is some amount of visibility.

(just a note in passing)

jimmyfrasche commented 2 months ago

as far as the shared interface I think they should all have:

// v = m[k]
At(K) V 

 // v, ok = m[k]
Get(K) (V, bool)

// _, ok = m[k]
Contains(K) bool

(Maybe At and Get aren't the best names but my main point is that there should be separate methods)

jimmyfrasche commented 2 months ago

To get back to this specific proposal:

The All and Keys methods iterate over Type instead of K.

There is no Values iterator.

Should the hash function return uint64 instead of uint for parity with hash/maphash?

adonovan commented 2 months ago

The All and Keys methods iterate over Type instead of K. There is no Values iterator.

Thanks, both fixed.

Should the hash function return uint64 instead of uint for parity with hash/maphash?

Probably, yes. Done.

aaronbee commented 2 months ago

We have implemented something similar at our organization: gomap.Map. It has the goal of providing the exact same semantics of Go's builtin map, but with user-provided hash and equals functions.

One difference in API from the current proposal is that the hash function takes in a maphash.Seed. The seed is maintained by the map and passed to the user's function. This makes it possible to write a hash function that makes use of maphash that doesn't need to allocate a closure to capture a maphash.Seed. It also allows the map to reset the seed when the map is emptied, similar to how the builtin map does it. And a small benefit is that the hash func has the same signature as maphash.Bytes, so you can very easily make a map that is keyed by a []byte. The downside is that its extra bloat on the map in case you don't want to use maphash.

mateusz834 commented 2 months ago

The downside is that its extra bloat on the map in case you don't want to use maphash.

I believe that by default we should use maphash or provide some easy way to use it, we should provide a primitive that is secure by default to avoid potential hash DoS attacks.

We can provide a maphash.Seed or *maphash.Hash into the hash function.

func NewMap[K, V any](hash func(maphash.Seed, K) uint64, eq func(K, K) bool) *Map[K, V]
earthboundkid commented 2 months ago

A stable string representation is useful for tests. There isn't actually a very good reason to call .String() on a Map besides for a test or debugging anyway, so the cost of sorting it is not a big deal. You're not going to display an unformatted Map to an end user and expect them to understand it.

On the method names bikeshed, I like Get(Key) Value and Lookup(Key) (Value, bool) as a pair.

aaronbee commented 2 months ago

We can provide a maphash.Seed or *maphash.Hash into the hash function.

Providing a *maphash.Hash to the hash function means that each map would have to hold one, and they are not small. A maphash.Hash is 152 bytes, compared to 8 bytes for a Seed.

mateusz834 commented 2 months ago

@aaronbee That is true, i would also prefer maphash.Seed, not doing so might cause allocation each time the hash function in executed or the Map would need to cache the maphash.Hash struct and use Reset before each call to hash.

Also that would make even more sense, with the newly introduced global functions: #54670. maphash.Comparable has the same signature, so it can be easily used for comparable types (if needed, people should prefer a builtin map type).

NewMap[int, int](maphash.Comparable[int], /*...*/)
jimmyfrasche commented 2 months ago

To work with maphash it needs 2 funcs (hash, eq) and a seed but that assumes everyone is using maphash.

A more general solution would be to take a seed type as a generic param but then you need that and a third func to handle resetting the seed on Clear (which does not appear to be part of the API yet).

That is getting unwieldy and overfits to maphash.

To avoid this, it could take an interface

type HashProvider[T any] interface {
    Hash(T) uint64
    Equal(T, T) bool
    Reset()
}
mateusz834 commented 2 months ago

that assumes everyone is using maphash. A more general solution would be to take a seed type as a generic param but then you need that and a third func to handle resetting the seed on Clear (which does not appear to be part of the API yet).

If this becomes a problem we can always add a Uint64() uint64 method to maphash.Seed, so that people can then use the seed with different hash implementations. Also it is still possible to create a seed outside of the hash closure and capture it, so it might not be a huge problem.

jimmyfrasche commented 2 months ago

The interface version doesn't limit the state to a uint64 or tie usage to maphash when it isn't needed while still allowing one to use maphash quite easily.

mateusz834 commented 2 months ago

The interface version doesn't limit the state to a uint64 or tie usage to maphash when it isn't needed while still allowing one to use maphash quite easily.

I agree with that, but that will complicate the API easily. I guess most users will never use anything else than maphash.

jimmyfrasche commented 2 months ago

it would only matter when creating a new map and you could still have

type NewMap[K, V any](hp HashProvider[K]) *Map[K, V] {/*...*/}
type NewMapFuncs[K, V any](hash func(K) uint64, eq func(K, K) bool) *Map[K, V] {
  return NewMap[K, V](&basicHasher[K]{hash, eq})
}
mateusz834 commented 2 months ago

But then the NewMapFuncs does not Seed the Hash, right?

type NewMapFuncs[K, V any](hash func(K) uint64, eq func(K, K) bool) *Map[K, V] {
  return NewMap[K, V](&basicHasher[K]{hash, eq})
}
jimmyfrasche commented 2 months ago

Here's an example of using maphash with the interface I posted

type ByteHasher struct {
    hash *maphash.Hash
}
func (b *ByteHasher) Hash(v []byte) uint64 {
    defer b.hash.Reset()
    b.hash.Write(v)
    return b.hash.Sum64()
}
func (b *ByteHasher) Equal(v0, v1 []byte) bool {
    return bytes.Equal(v0, v1)
}
// called on first use and Clear
func (b *ByteHasher) Reset() {
    if b.hash == nil {
        b.hash = &maphash.Hash{}
    }
    b.hash.SetSeed(maphash.MakeSeed())
}

var m = unordered.NewMap[[]byte, someV](&ByteHasher{})

and without using maphash at all

var m = unordered.NewMapFuncs[type.Type, someV](types.Hash, types.Identical)
Merovius commented 2 weeks ago

In my opinion, the fact that maphash.Seed is opaque makes it a poor choice for a "hash map with customizable hash function". By design, there is no guarantee of what a maphash.Seed actually is and different hash functions might have different needs. And statements like

If this becomes a problem we can always add a Uint64() uint64 method to maphash.Seed

are really just saying "seed should be a uint64", I would argue. Which I would definitely prefer to eroding the maphash API.

I agree with @jimmyfrasche that making the seed type transparent is an inherent advantage of the interface approach. I would much prefer it over "passing multiple functions". Also, because that actually makes it clearer that the different operations need to co-operate. This is something that is special about this proposal: As far as I can tell, this is the first time we would have a standard library package that needs more than one user-defined "operator" that are connected in how they behave. eq(x, y) must imply hash(x) == hash(y).

That being said, I am a bit concerned because making the seed transparent means that to support it, you intrinsically have to allocate. No matter whether we use functions or an interface. One way to solve that would be to make it a type-parameter:

type HashProvider[T, Seed any] struct {
    Hash(T, Seed) uint64
    Equal(T, T) bool
    Reset(Seed)
}

That way, you could have a HashProvider[T, maphash.Seed] implementation. And you could push Seed one level up and make it a type-parameter on Map as well and provide a Seed in the constructor.

I have also argued in various places, that the "take functions to customize operators" approach has an intrinsic downside for containers specifically: It means the container can no longer have a useful zero value. Arguably, we are used to that with map, but I would argue, experience has also shown that to be a problem. It's probably the most common reason why types don't have a useful zero value in practice.

In my opinion, for containers specifically, using methods on the element type is better. It does require additional boilerplate in some cases, yes. But it enables useful zero values and it makes it easier for the compiler to generate efficient code. Function-value fields can theoretically change, but a method can't. In my opinion, if every container-operation has to incur an indirect call because the compiler can't properly devirtualize it, it almost defeats the purpose of user-defined containers in the first place. Which is that I care deeply about specific performance characteristics.

In any case, I think my largest point with this comment is: The design space for library container types is huge and IMO still mostly unexplored - at least when it comes to exploration by the Go team itself (who I trust most when it comes to Go API design). I'm not aware of any generic containers in the standard library or x/ yet. So, at the very least, I'd like us to put anything into x/exp first. To have something discoverable to experiment with.

I'd point to x/exp/rand as an example of that being very helpful. There are several design decisions made for rand/v2 that where non-obvious before having x/exp/rand to experiment with. That is, x/exp/rand seemed like a good API for rand/v2, until we actually started using it. Mostly around exactly the problem of seeding, coincidentally.