Open adonovan opened 1 day ago
Related Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
(The minimal answer without Seeds is definitely the wrong one, as you explained before giving it.)
One approach is to define
type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
}
A generic hash table with keys of type T will take a value of type Hasher[T], and invoke the methods of that value to compute hash values and to compare elements for equality.
Then we can have
// MakeHasher returns a Hasher[T] that invokes the hash and equal functions.
// Note that this does not use an explicit seed.
func MakeHasher[T any](hash(T) uint64, equal(T, T) bool) Hasher[T] { ... }
// MakeSeededHasher returns a Hasher[T] that uses a random seed.
func MakeSeededHasher[T any](hash(T, uint64), equal(T, T) bool) Hasher[T] {
return &seededHasher[T]{hash: hash, equal: equal, seed: randUint64()}
}
type seededHasher[T any] struct {
hash func(T, uint64)
equal func(T, T) bool
seed uint64
}
func (sh *seededHasher[T]) Hash(v T) uint64 {
return sh.hash(v, sh.seed)
}
func (sh *seededHasher[T]) Equal(a, b T) bool {
return sh.equal(a, b)
}
func (sh *seededHasher[T]) SetSeed(seed uint64) {
sh.seed = seed
}
// MakeMapHashHasher returns a Hasher[T] that uses [maphash.Comparable].
func MakeMapHashHasher[T comparable](seed maphash.Seed) Hasher[T] {
return mapHashHasher[T]{seed}
}
type mapHashHasher[T comparable] struct {
seed maphash.Seed
}
func (mhh mapHashHasher[T]) Hash(v T) uint64 {
return maphash.Comparable(mhh.seed, v)
}
func (mhh mapHashHasher[T]) Equal(a, b T) bool {
return a == b
}
(The minimal answer without Seeds is definitely the wrong one, as you explained before giving it.)
Indeed, I was hoping to confirm Cunningham's law.
func (sh *seededHasher[T]) Hash(v T) uint64 { return sh.hash(v, sh.seed) }
This approach doesn't support a per-table seed.
That said, I still don't really understand how even a per-table seed provides a defense against flooding unless it is passed through the hash function itself. If the attacker knows the hash function and can provide all the keys in the table, then it's not enough to transform the hashes at the end, since they will all be transformed in the same way, whatever that is, preserving collisions. It seems to me that the seed would need to be threaded through the hash function so that whether two keys' hashes collide is impossible to predict from the sequence of elements that are incorporated into the hash.
This approach doesn't support a per-table seed.
I don't think that's super important. A per-process seed is probably good enough.
That said, I still don't really understand how even a per-table seed provides a defense against flooding unless it is passed through the hash function itself.
Yes, it must be passed through the hash function itself. Ian's proposed code does that.
I believe that MakeSeededHasher
does support a per-table seed, because you can pass a call to MakeSeededHasher
to the hash table's Make
function. Each hash table will then be using a different seed.
That is, I was thinking in terms of passing a Hasher[T]
as a value when making a hash table. But there is another approach, which is to make Hasher[T]
a type parameter of the hash table. That approach has the advantage that it permits inlining and the hash and equality methods. That approach still requires that the hash table include a value of type Hasher[T]
. And we can arrange for that value to initialize itself with a seed when first called.
TL;DR: Of the options I present below, option 3 feels like the best balance to me.
Note that a lot of this discussion is.. uh.. rehashing #69559.
I worry that @ianlancetaylor 's API proposed in https://github.com/golang/go/issues/70471#issuecomment-2489681099 (or one where a hash table is parameterized by the hasher type), makes it easier to write a non-seeded hash function than a seeded one. I'd much rather an API that makes it an obvious mistake to not seed the hash function.
Also, I think it's important to compare what impact different API options would have on a custom hash table type. Otherwise we're operating in a vacuum.
Setting aside for a moment the type of the seed, option 1 is an obvious extension of @adonovan 's API:
type HashFunc[T any] func(Seed, T) uint64
This leads to a hash table API like:
type EqFunc[T any] func(T, T) bool
type HashTable[T any] struct {
seed Seed
hash HashFunc[T]
eq EqFunc[T]
...
}
func NewHashTable[T any](hash HashFunc[T], eq EqFunc[T]) *HashTable
@mateusz834 proposed this here.
Option 1 requires the hash table to store three words for its hasher, and calls to the hasher and equal are indirect.
Option 2 is to do this same transformation, but to @ianlancetaylor 's Hasher interface:
type Hasher[T any] interface {
Hash(Seed, T) uint64
Equal(T, T) bool
}
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
hash Hash // Probably zero-sized
seed Seed
...
}
Option 2 most likely stores just the seed in the hash table, which is minimal, and calls to the hasher and equal functions will be direct. However, the type implementing Hasher will almost always be an empty struct, so we're in the land of pure type metaprogramming here, which isn't wrong but sure feels weird. This definitely suffers from the lack of type-type inference.
Option 3 is to make the Hasher store the seed:
type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
Reset(Seed)
}
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
hash Hash // Probably just struct { Seed }
...
}
func NewHashTable[T any, Hash Hasher[T]]() *HashTable[T, Hash] {
ht := &HashTable[T, Hash]{}
seed := MakeSeed()
ht.hash.Reset(seed)
...
}
This is similar to a proposal by @Merovius.
Option 3, like option 2, stores just the seed in the hash table, and calls to the hasher are direct, but suffers from the lack of type-type inference. However, it avoids pure metaprogramming. It feels slightly weird that the type implementing Hasher is mutable, but that's not too unusual.
Option 4 avoids mutability by introducing a hash constructor function:
type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
}
type NewHasher[T any, Hash Hasher[T]] func(Seed) Hash
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct{
h Hash
...
}
func NewHashTable[T any, Hash Hasher[T]](newHasher NewHasher[T, Hash]) *HashTable[T, Hash] {
seed := MakeSeed()
h := newHasher(seed)
return &HashTable[T, Hash]{h: h, ...}
}
This is pretty similar to option 3, but it feels like it's getting heavy on the mechanism.
And perhaps Seed should also provide methods for hashing integers.
As for the Seed type, this is an interesting idea, though in practice I think a lot of hash functions would either want to ultimately pass the Seed on to maphash (so it's just opaque) or will want to get transform the seed into a uint64 or some bytes or something they can use directly. You could do that with the interface by passing, e.g., 0, 1, etc, but that feels not very obvious.
Alternatively, the Seed could provide a few different ways to get a concrete seed value. E.g., it could have a Uint64 method to get it as a uint64, and a Bytes method to populate an arbitrarily large []byte. These methods would be idempotent: pure functions of the Seed's hidden value. And there would be no way to construct a Seed from any of these representations.
If we want to support hash functions that derive something from Seed and use that instead of using Seed opaquely, I think we need to have something like Reset
from option 3 or NewHasher
from option 4 so it has a chance to derive what it needs to derive.
I don't want to touch maphash.Seed
. I believe maphash.Seed
is fine as it is. If we want to enable a hash function to work with different seed types, I think the straight forward way to do that would be to make Seed
a type parameter on the concrete implementation. Not to change maphash.Seed
to be transformable into a bunch of different types. Then maphash.Seed
can be one possible type argument, uint64
can be another, and if a hash function needs more bits, it can use []byte
or struct { Lo, Hi uint64 }
or whatever.
I think the advantage of @ianlancetaylor's design is that it makes the type of seed opaque to the hash table. I think that is a good thing. It gives maximum flexibility, while staying simple in usage. If we want to make it maximally simple to use safely, we could do something like
type Hasher[T any] interface{
Hash(T) uint64
Equal(T, T) bool
}
// exported type, to optionally avoid indirection.
type MapHashHasher[T comparable] struct {
Seed maphash.Seed
}
func (h *MapHashHasher[T]) Reset() { h.Seed = maphash.MakeSeed() }
func (h MapHashHasher[T]) Hash(v T) uint64 { return maphash.Comparable(h.Seed, v) }
func (h MapHashHasher[T]) Equal(a, b T) bool { return a == b }
func MakeMapHashHasher[T comparable]() MapHashHasher { return MapHashHasher[T]{maphash.MakeSeed()} }
type FuncHasher[T, Seed any] struct {
hash(Seed, T) uint64
equal(T, T) bool
seed Seed
}
func MakeFuncHasher[T, Seed any](hash func(Seed, T) uint64, equal func(T, T) bool, seed Seed) FuncHasher {
return FuncHasher[T, Seed]{hash, equal, seed}
}
func (h FuncHasher) Hash(v T) uint64 { return h.hash(h.seed, v) }
func (h FuncHasher) Equal(a, b T) bool { return h.equal(a, b) }
To be fair, this still makes it easy to construct a badly seeded hasher for composite types. There are two ways to deal with that:
Seed
into MakeFuncHasher
at all and instead just make it a fixed uint64
. That way, we can choose it at random and leave it up to third parties to provide implementations with other seed types. After all, the fact that the seed type is opaque is exactly the point.Seed
has a Reset(*rand.Rand)
method. But… blegh.With this, a hashtable API can decide itself whether it wants to parameterize over Hasher[T]
(avoiding an indirection) or just take a Hasher[T]
value in the constructor (avoiding extra type parameters).
I think ultimately, the Hasher
interface not knowing about Seed
is the maximally general interface. The type of the seed is, ultimately, up to the implementation. So it shouldn't appear in the interface. And I think it's easy to see, that we can add whatever set of concrete implementations we want. Or perhaps don't provide any - in the beginning - putting that up to the community. Or put the base interface into the standard library (so we can use it in our APIs) but then put a bunch of concrete, experimental implementations into x/exp
and see which of them get adopted to ultimately get promoted into the stdlib.
Background: Issue https://github.com/golang/go/issues/69420 proposes a
types.Hash
function that provides a hash operator fortypes.Type
values that is consistent with the equivalence relation for types defined bytypes.Identical
. The only real question in that proposal is: what is to be Go's future convention for the signature of a custom hash function?The naive answer is
func(T) uint64
, where T stands fortypes.Type
. However, hash tables with only a single hash function are vulnerable to a "hash flooding" denial-of-service attack, in which an adversary provides many inputs that all hash to the same value, causing the table to degenerate into a linked list. The defense is for each process (or particular hash table) to select a different hash function at random so that the values cannot be predicted. Thehash/maphash
package embodies this approach: a random Seed value determines the hash functions (Hash.Strings
andHash.Bytes
), which depend on the seed's opaque random value. (maphash does not provide a Seed-dependent way to hash integers.)So, perhaps custom hash functions should accept a
maphash.Seed
parameter. And perhaps Seed should also provide methods for hashing integers.Proposal: In the hash package, define function types HashFunc and EqFunc that document the conventions for Go hash tables.
Here is the minimal answer, without Seeds:
@ianlancetaylor @griesemer