Open adonovan opened 2 months ago
Related Issues and Documentation
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Your proposal has several semantic differences from the builtin map type. Are these intentional or oversights?
Ones I noticed:
NewMap
.At
doesn't provide a way to tell if the key was present (ok
) (you have this one in open questions).Delete
reports whether the key existed. Builtin delete
does not.Set
returns the previous value, builtin map assignment does not.String
"returns a string representation of the map's entries in unspecified order"; fmt.Print
of a builtin map sorts the keys if possible.On the implementation detail side, if we keep semantics identical to builtin maps, I suspect we could share an implementation (except for iteration). If semantics differ slightly, I think sharing would be difficult, as the builtin maps are very performance sensitive, so adding special cases may hurt overall performance. I worry a bit about maintenance burden about having many map implementations (builtin, sync.Map, this one, #60630), especially if we want them all to be high performance.
All good questions. My personal feelings are:
More important to me than any of these items is that the designs for ordered map, unordered map and set are considered together.
I would much prefer the name Get instead of At. Seems to parallel Set, and matches existing similar functions I've seen. I recommend Get returning the zero value if the key is not present (simpler to use most of the time) but then having a Lookup method which returns an ok bool as well (a pain if it's not available when you do need it). This is like reflect.StructTag.Get and .Lookup, and os.Getenv vs LookupEnv.
Stable String order is very useful IMO. I think "stable" is better than "sorted", eg "10" before "2" is okay as long as it's well defined. The main thing you want is stability for comparisons (normally in tests, but not always).
Size hint - I think a sensible default is normally fine. We can always add NewMapSize later on if really needed.
I think needing the previous value in Set and Delete would be extremely rare. I'd slightly prefer not having it for a simpler API.
Out of interest, would this work for starlark-go's maps, or is that too specific a use case?
I think needing the previous value in Set and Delete would be extremely rare. I'd slightly prefer not having it for a simpler API.
It's quite common to want the previous value back from set, either for its value, or to distinguish insert from update at no cost. This can be done by calling Len() before and after Set, but I think all our collections types should provide at least a boolean result from Set. Delete too for similar reasons, though deletions are rare in practice.
I agree with your other thoughts.
Out of interest, would this work for starlark-go's maps, or is that too specific a use case?
Starlark maps are insertion-ordered, and we definitely aren't promising that here. (I think the other semantics about mutation, freeze, and iteration could be implemented as a wrapper.)
Meta:
So there are proposals for
From that, it seems like there are missing entries
whateverPackage.Map[K, V]
and not have to worry about which map/set is usedNot sure that the ordered map/set are a good idea in the stdlib. They can be easily implemented outside and the key order was randomized for a reason.
Just so that there is some amount of visibility.
(just a note in passing)
as far as the shared interface I think they should all have:
// v = m[k]
At(K) V
// v, ok = m[k]
Get(K) (V, bool)
// _, ok = m[k]
Contains(K) bool
(Maybe At
and Get
aren't the best names but my main point is that there should be separate methods)
To get back to this specific proposal:
The All and Keys methods iterate over Type instead of K.
There is no Values iterator.
Should the hash function return uint64
instead of uint
for parity with hash/maphash
?
The All and Keys methods iterate over Type instead of K. There is no Values iterator.
Thanks, both fixed.
Should the hash function return
uint64
instead ofuint
for parity withhash/maphash
?
Probably, yes. Done.
We have implemented something similar at our organization: gomap.Map. It has the goal of providing the exact same semantics of Go's builtin map, but with user-provided hash and equals functions.
One difference in API from the current proposal is that the hash function takes in a maphash.Seed
. The seed is maintained by the map and passed to the user's function.
This makes it possible to write a hash function that makes use of maphash
that doesn't need to allocate a closure to capture a maphash.Seed
. It also allows the map to reset the seed when the map is emptied, similar to how the builtin map does it. And a small benefit is that the hash func has the same signature as maphash.Bytes
, so you can very easily make a map that is keyed by a []byte
. The downside is that its extra bloat on the map in case you don't want to use maphash
.
The downside is that its extra bloat on the map in case you don't want to use maphash.
I believe that by default we should use maphash
or provide some easy way to use it, we should provide a primitive that is secure by default to avoid potential hash DoS attacks.
We can provide a maphash.Seed
or *maphash.Hash
into the hash function.
func NewMap[K, V any](hash func(maphash.Seed, K) uint64, eq func(K, K) bool) *Map[K, V]
A stable string representation is useful for tests. There isn't actually a very good reason to call .String() on a Map besides for a test or debugging anyway, so the cost of sorting it is not a big deal. You're not going to display an unformatted Map to an end user and expect them to understand it.
On the method names bikeshed, I like Get(Key) Value
and Lookup(Key) (Value, bool)
as a pair.
We can provide a
maphash.Seed
or*maphash.Hash
into the hash function.
Providing a *maphash.Hash
to the hash function means that each map would have to hold one, and they are not small. A maphash.Hash
is 152 bytes, compared to 8 bytes for a Seed
.
@aaronbee That is true, i would also prefer maphash.Seed
, not doing so might cause allocation each time the hash function in executed or the Map would need to cache the maphash.Hash
struct and use Reset
before each call to hash.
Also that would make even more sense, with the newly introduced global functions: #54670. maphash.Comparable
has the same signature, so it can be easily used for comparable types (if needed, people should prefer a builtin map type).
NewMap[int, int](maphash.Comparable[int], /*...*/)
To work with maphash
it needs 2 funcs (hash, eq) and a seed but that assumes everyone is using maphash
.
A more general solution would be to take a seed type as a generic param but then you need that and a third func to handle resetting the seed on Clear
(which does not appear to be part of the API yet).
That is getting unwieldy and overfits to maphash
.
To avoid this, it could take an interface
type HashProvider[T any] interface {
Hash(T) uint64
Equal(T, T) bool
Reset()
}
that assumes everyone is using maphash. A more general solution would be to take a seed type as a generic param but then you need that and a third func to handle resetting the seed on Clear (which does not appear to be part of the API yet).
If this becomes a problem we can always add a Uint64() uint64
method to maphash.Seed
, so that people can then use the seed with different hash implementations. Also it is still possible to create a seed outside of the hash closure and capture it, so it might not be a huge problem.
The interface version doesn't limit the state to a uint64 or tie usage to maphash when it isn't needed while still allowing one to use maphash quite easily.
The interface version doesn't limit the state to a uint64 or tie usage to maphash when it isn't needed while still allowing one to use maphash quite easily.
I agree with that, but that will complicate the API easily. I guess most users will never use anything else than maphash.
it would only matter when creating a new map and you could still have
type NewMap[K, V any](hp HashProvider[K]) *Map[K, V] {/*...*/}
type NewMapFuncs[K, V any](hash func(K) uint64, eq func(K, K) bool) *Map[K, V] {
return NewMap[K, V](&basicHasher[K]{hash, eq})
}
But then the NewMapFuncs
does not Seed the Hash, right?
type NewMapFuncs[K, V any](hash func(K) uint64, eq func(K, K) bool) *Map[K, V] {
return NewMap[K, V](&basicHasher[K]{hash, eq})
}
Here's an example of using maphash
with the interface I posted
type ByteHasher struct {
hash *maphash.Hash
}
func (b *ByteHasher) Hash(v []byte) uint64 {
defer b.hash.Reset()
b.hash.Write(v)
return b.hash.Sum64()
}
func (b *ByteHasher) Equal(v0, v1 []byte) bool {
return bytes.Equal(v0, v1)
}
// called on first use and Clear
func (b *ByteHasher) Reset() {
if b.hash == nil {
b.hash = &maphash.Hash{}
}
b.hash.SetSeed(maphash.MakeSeed())
}
var m = unordered.NewMap[[]byte, someV](&ByteHasher{})
and without using maphash
at all
var m = unordered.NewMapFuncs[type.Type, someV](types.Hash, types.Identical)
In my opinion, the fact that maphash.Seed
is opaque makes it a poor choice for a "hash map with customizable hash function". By design, there is no guarantee of what a maphash.Seed
actually is and different hash functions might have different needs. And statements like
If this becomes a problem we can always add a
Uint64() uint64
method tomaphash.Seed
are really just saying "seed should be a uint64
", I would argue. Which I would definitely prefer to eroding the maphash
API.
I agree with @jimmyfrasche that making the seed type transparent is an inherent advantage of the interface approach. I would much prefer it over "passing multiple functions". Also, because that actually makes it clearer that the different operations need to co-operate. This is something that is special about this proposal: As far as I can tell, this is the first time we would have a standard library package that needs more than one user-defined "operator" that are connected in how they behave. eq(x, y)
must imply hash(x) == hash(y)
.
That being said, I am a bit concerned because making the seed transparent means that to support it, you intrinsically have to allocate. No matter whether we use functions or an interface. One way to solve that would be to make it a type-parameter:
type HashProvider[T, Seed any] struct {
Hash(T, Seed) uint64
Equal(T, T) bool
Reset(Seed)
}
That way, you could have a HashProvider[T, maphash.Seed]
implementation. And you could push Seed
one level up and make it a type-parameter on Map
as well and provide a Seed
in the constructor.
I have also argued in various places, that the "take functions to customize operators" approach has an intrinsic downside for containers specifically: It means the container can no longer have a useful zero value. Arguably, we are used to that with map
, but I would argue, experience has also shown that to be a problem. It's probably the most common reason why types don't have a useful zero value in practice.
In my opinion, for containers specifically, using methods on the element type is better. It does require additional boilerplate in some cases, yes. But it enables useful zero values and it makes it easier for the compiler to generate efficient code. Function-value fields can theoretically change, but a method can't. In my opinion, if every container-operation has to incur an indirect call because the compiler can't properly devirtualize it, it almost defeats the purpose of user-defined containers in the first place. Which is that I care deeply about specific performance characteristics.
In any case, I think my largest point with this comment is: The design space for library container types is huge and IMO still mostly unexplored - at least when it comes to exploration by the Go team itself (who I trust most when it comes to Go API design). I'm not aware of any generic containers in the standard library or x/
yet. So, at the very least, I'd like us to put anything into x/exp
first. To have something discoverable to experiment with.
I'd point to x/exp/rand
as an example of that being very helpful. There are several design decisions made for rand/v2
that where non-obvious before having x/exp/rand
to experiment with. That is, x/exp/rand
seemed like a good API for rand/v2
, until we actually started using it. Mostly around exactly the problem of seeding, coincidentally.
Background: In https://github.com/golang/go/issues/69420 I proposed to promote the golang.org/x/tools/go/types/typeutil.Map data type to the go/types package, modernizing it with generics. @jimmyfrasche pointed out that really the only part of that proposal that needs to be in go/types is the
types.Hash
function, which should have semantics consistent withtypes.Identical
: that is, identical types must have equal hashes. So, I reduced that proposal to just the hash function, and this proposal governs the table.Proposal: the standard library should provide a generic hash table that allows the user to specify the hash function and equivalence relation. Here is a starting point for the API:
Open questions:
Related: