golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.17k stars 17.69k forks source link

proposal: hash: export a built-in hash function for comparable values #21195

Closed bcmills closed 7 years ago

bcmills commented 7 years ago

Background

The Go runtime contains an efficient hash function that it uses to implement built-in maps.

The runtime hash function is capable of hashing arbitrary comparable values, and usually cannot be pruned away during linking: for example, it would be very difficult for the linker to determine which hash functions are needed in a program that uses reflect.MapOf to construct maps at run-time, or a program that stores interface{} values in a map[interface{}]Anything.

Occasionally, we need an efficient hash function outside the standard library, too: for example, we might want to implement other hash-based data structures, such as hash array mapped tries, concurrent tries, or prototypes for potential additions to the Go standard library (e.g. sync.Map), or write application code to compute hashes for normally-incomparable types (such as slices or maps) to be able to store them in built-in maps. (For the latter case, a built-in hash function would be a building block for a larger application-level hash function.)

Typical workarounds include:

I believe that the toil involved in implementing these workarounds discourages experimentation with more efficient data structures in Go, and needlessly complicates application code.

It would be reckless for us to expose a stable hash function in the standard library. However, it seems like it would be useful to expose a per-process hash function, randomly salted at each invocation of the process, in order to facilitate the use of custom data structures in user code.

Proposal

I propose that we add the following function to the hash package:

// Local returns a hash code for a comparable argument using a hash function
// that is local to the current invocation of the program.
//
// Two values that compare equal using the == operator (and values that are
// unequal but have the same type and representation in memory, such as
// floating-point NaNs with the same bitwise representation) will have the same
// hash code. Values that compare unequal will have a high probability of
// returning different hash codes.
//
// Local panics if the argument is not comparable.
//
// Each call to Local with the same value will return the same result within the
// lifetime of the program, but each run of the program may return different
// results from previous runs.
func Local(interface{}) uintptr

(@orcaman @OneOfOne @aclements @dsnet)

OneOfOne commented 7 years ago

Also this would allow us to use a very efficient version of sync.Map that allows sharding by a key's hash.

dsnet commented 7 years ago

It does feel weird to me though that this would go in the runtime package. Perhaps the reflect package instead?

OneOfOne commented 7 years ago

@dsnet the function(s) in question are a part of the runtime though, for example https://github.com/golang/go/blob/5a6c58099085a8156bc42b68a7cf51b5b9c72802/src/runtime/alg.go#L38

However, it's not that big of a deal where it is as long as it matches internally with map's hash.

dsnet commented 7 years ago

I'm aware of where the function is implemented, but that doesn't always correlate to where it should be categorically located. Arguably a lot of logic in reflect is in the "runtime", but we don't move it all to the runtime package either.

bcmills commented 7 years ago

@OneOfOne, sync.Map can technically already use the runtime's hash functions (using //go:linkname shenanigans), so by itself it's not a very compelling use case. However, an exported built-in hash would make it easier to experiment with changes or variations.

bcmills commented 7 years ago

@dsnet I don't feel strongly about which package it should go in, but reflect seems like a bit of an awkward fit: the only type-introspection required in the proposed API is the knowledge that the value is comparable.

ianlancetaylor commented 7 years ago

Previously proposed, more or less, in this golang-dev thread: https://groups.google.com/d/msg/golang-dev/vd5Nnt3Wjj0/a5XFu6d9wsIJ .

Any hash function can only be discussed in conjunction with an equality function, so the docs for your proposed function should make clear that values that are == in Go terms will get the same hash value, and that values that are comparable but != will get different hash values.

You suggest that this should only be used for comparable values, but of course that means that it can not be used for slices, which would seem to leave out many real world applications. (The current runtime hash will panic if used on a slice, as can be seen using a map[interface{}]bool.) Are you sure you want to restrict only to comparable types?

If we do this I definitely don't think it should be in the runtime package. That would suggest that it is specifically the hash function used by things like map, and even if that is in fact what is implemented, it should not be implied.

bcmills commented 7 years ago

Previously proposed, more or less, in this golang-dev thread:

Thanks for the link; an interesting read. It looks like @randall77's objection there was mainly that we would have to specify and freeze the algorithm, but here I'm suggesting the opposite: if we expose a hash function, we should randomize it so that folks won't accidentally rely on the particulars.

You suggest that this should only be used for comparable values, but of course that means that it can not be used for slices, which would seem to leave out many real world applications.

That's true, but having the built-in hash would make it much easier to define application hash functions for slices: it is likely that one could use the built-in hash function for the individual elements and write only a very small function to combine them.

func hashSlice(s []T) uintptr {
  h := uintptr(0)
  for _, e := range s {
    h = Hash([2]uintptr{h, Hash(e)})
  }
  return h
}

Are you sure you want to restrict only to comparable types?

I think that's a good starting point since, as you note, a hash function must be paired with an equality function. The language spec already defines an equality function, so piggybacking on that minimizes the amount of new information one would need to learn in order to use the function.

If we do this I definitely don't think it should be in the runtime package.

That's fair. Any suggestions for alternatives? It could plausibly live in hash (or a subpackage thereof), reflect (as suggested by @dsnet), or perhaps a subpackage of container.

bcmills commented 7 years ago

Any hash function can only be discussed in conjunction with an equality function, so the docs for your proposed function should make clear that values that are == in Go terms will get the same hash value, and that values that are comparable but != will get different hash values.

Updated, with an explicit caveat for NaN values (to avoid the confusion of #20660).

ianlancetaylor commented 7 years ago

Given that the standard library already has a package named "hash", I would vote for that.

dsnet commented 7 years ago

You suggest that this should only be used for comparable values, but of course that means that it can not be used for slices, which would seem to leave out many real world applications.

Comparability is well defined in Go. If we expand this, would it still perform a shallow hash of pointers or would it hash the sub-value pointed to by the pointer?

dsnet commented 7 years ago

If we expand this Hash to operate on incomparable types, ignoring hash collisions, would Hash(x) == Hash(y) be equivalent to reflect.DeepEqual(x, y)? If the answer is no, then reflect would not be a good home since these two don't seem to be consistent with each other.

Currently, the == operator is not identical to reflect.DeepEqual in the cases of pointers.

martisch commented 7 years ago

The runtime hash function is free to use different algorithms on a per architecture and even feature set basis (AES vs non AES). It think it would be nice to take this into account for the proposed std lib hash function too such that we can optimize the hash function used for each architecture and feature set independently if the proposal is accepted. Therefore, apart from only having reproducible results within a process we could clarify that there should not be the expectation that the algorithm and "quality" of the hash is the same between different releases/archs/cpus.

Another discussion could be if we want to extend the compiler to optimize this specific std lib hash function e.g. to a specialized call if the type is known during compile time.

bcmills commented 7 years ago

@dsnet

If [Hash is not consistent with reflect.DeepEqual], then reflect would not be a good home

Agreed. The Hash function I have in mind (admittedly, thinking about experimental variations on sync.Map) would be consistent with == except for irreflexive values (e.g. NaNs), which implies that it cannot also be consistent with reflect.DeepEqual.

A reflect.DeepHash might be useful independently, c.f.: http://godoc.org/github.com/davegardnerisme/deephash http://godoc.org/k8s.io/kubernetes/pkg/util/hash …but that's not the use-case I intended to address here.

bcmills commented 7 years ago

Updated the proposal to name the function hash.Local, since the consensus seems to be that neither reflect nor runtime is a good fit.

jimmyfrasche commented 7 years ago

The first (CS related) search result for local hash is https://en.wikipedia.org/wiki/Locality-sensitive_hashing so that may not be the most appropriate name.

josharian commented 7 years ago

Some observations. I'm still not sure where I stand on this, on balance.

Another discussion could be if we want to extend the compiler to optimize this specific std lib hash function e.g. to a specialized call if the type is known during compile time.

Without this, the calls will allocate+copy, and as a result probably be prohibitively expensive in the contexts it is most likely to be used. With compiler optimization, this ends up seeming a bit like a language change.

Another alternative is to expose hash functions for basic types (ints of various widths, strings). These are the core building blocks of the rest, which can then be composed as desired, particularly if the hash function accepts a seed (more on that below).

An alternative is to expose these via the reflect package. For example, hash could be a method on reflect.Value. Or reflect.Type could return a hash function that can be invoked with values of its type.

A good concrete test case might be how e.g. [1000]uint64 gets hashed. Passed as an interface{}, this will allocate and copy 64k. If we expose a suite of hash functions, the caller either needs to loop 1000 times (slow) or we need to provide some variable length memhash, which is inherently unsafe. If we expose only via reflect, then we add the readability costs of reflection, we add the high general costs of generating a reflect.Value or the cost of an indirect function call (reflect.Type), and we tie the tool chain's hands a bit (since IIRC it can do more magic when reflect is not in use). The only reasonable choice appears to be magic compiler support, since the compiler can ~safely do unsafe things.


As an aside, an interesting test of a generics proposal is whether and how it would let you write a generic hash function (attn: @ianlancetaylor).


One other observation: The runtime hash functions accept a seed. They do this, to quote @randall77, because in the case of maps:

You have to take a seed to prevent adversaries from engineering a collision. Without the seed, the hash computation is deterministic and so an adversary can just run lots of keys through that hash and find a bunch that all hash to the same value.

I assume that a prime use case for exposing hash functions is allow people to write their own maps. We should at least afford them the opportunity to write secure implementations. That is, any exposed hash function should accept a seed, just like the runtime's do.

And as observed above, accepting a seed allows the programmer to compose hash calls as necessary.

bcmills commented 7 years ago

@josharian

Another discussion could be if we want to extend the compiler to optimize this specific std lib hash function e.g. to a specialized call if the type is known during compile time.

Without this, the calls will allocate+copy, and as a result probably be prohibitively expensive

Why would the calls allocate? With sync.Map I found that the compiler's escape analysis is pretty reliable: the built-in hash function should not allow the values passed to it to escape, so any allocations for the interface{} value to pass to it should occur (inexpensively) on the stack.

Mid-stack inlining (#19348) should omit that allocation entirely if the value is not already an interface type, since the top-level hash function will dispatch based on a constant type tag.

A good concrete test case might be how e.g. [1000]uint64 gets hashed. Passed as an interface{}, this will allocate and copy 64k.

Ahh, now I see what you're getting at. But that should still be a general optimization ("don't copy interface payloads when they are treated as const and do not escape"), not something specific to the hash function. (It's the same optimization as #18822, but made much easier because the call does not occur through an interface.)

aclements commented 7 years ago

Without this, the calls will allocate+copy, and as a result probably be prohibitively expensive in the contexts it is most likely to be used. With compiler optimization, this ends up seeming a bit like a language change.

I agree that the copy is a problem. The function could instead be defined to take a pointer to the value to be hashed so that it doesn't have to be copied.

If we do that, I don't think allocation is a problem either, since escape analysis should already realize that we're not escaping the pointer in the interface.

Arguably, the compiler could avoid the copy even if we didn't pass a pointer if it could prove that neither the hash function nor the caller mutates the value, but I'd be fine with passing a pointer. :)

bcmills commented 7 years ago

@josharian

That is, any exposed hash function should accept a seed, just like the runtime's do.

That's part of the proposal already, I think? I proposed that each invocation of the program should compute a process-local seed that the hash function uses implicitly; that way users won't have to figure out how to pick an to pick an appropriate seeding algorithm, and programs will have at least a basic measure of collision-resistance by default.

aclements commented 7 years ago

I proposed that each invocation of the program should compute a process-local seed that the hash function uses implicitly

I think @josharian is talking about the per-invocation seed. For example, every map in a process has a different seed, so even if someone manages to create a collision in one map, it won't carry over to another. This is in addition to the per-process seed.

rsc commented 7 years ago

I'm super-skeptical about exposing any hash function. In the case of the current map implementation, if we did ever want to do things like moving pointers (which would influence the hashes) we know where the maps are and where the hash values are written, so we could rehash. Exposing a new hash function limits our ability to make such changes, with no clear benefit. Existing packages are using other hash functions just fine without our help.

josharian commented 7 years ago

that should still be a general optimization ("don't copy interface payloads when they are treated as const and do not escape")

The runtime hash functions currently make indirect function calls, including to routines written in assembly. I'm very skeptical that escape analysis is going to plumb those depths without explicitly teaching the compiler about these special cases. Mid-stack inlining doesn't help, because the value eventually escapes.

The function could instead be defined to take a pointer to the value to be hashed so that it doesn't have to be copied.

This seems more promising, although it's a bit of a footgun: People have to remember to pass a pointer to the value to be hashed, even when that value is itself already a pointer (that is, a T instead of a T, or a T instead of a **T). I see this going wrong on a regular basis. And there's no clear way that we can help catch that mistake at compile time or at run time, at least not without a language change.

I think @josharian is talking about the per-invocation seed.

I was indeed, although it now occurs to me that that can be simulated by just hashing a struct containing the value you care about and a seed. So then the question is whether it is important/useful enough to impose it in the API signature.

if we did ever want to do things like moving pointers (which would influence the hashes) we know where the maps are and where the hash values are written, so we could rehash. Exposing a new hash function limits our ability to make such changes, with no clear benefit.

Agreed.

For what it's worth, the thing I have wanted over and over is access to runtime.strhash. (I tend to modify my tool chain to export it, but that's obviously not something we want people doing in general.) Exposing it appropriately would not inhibit moving pointers, nor would exporting the runtime's int- or float-hashing routines which I have also wanted, although less frequently. I think this wouldn't satisfy @bcmills's original goals, though.

bcmills commented 7 years ago

@rsc

Exposing a new hash function limits our ability to make such changes [as moving pointers], with no clear benefit.

Doesn't reflect.Value.Pointer already limit our ability to make changes that affect object identity?

Existing packages are using other hash functions just fine without our help.

Well, sort of. I think this proposal is a lot like #15292 (generic programming), in that the primary impact of the status quo is to reduce the diversity of packages that people are willing to provide, support, and/or experiment with. The difficulty of writing hash-based data structures in user code pushes people toward less efficient workarounds (such as serializing keys to strings using encoding/binary, encoding/json, or another such package) or one-off solutions (such as accepting only string keys or only int keys).

ScottMansfield commented 7 years ago

I haven't seen anyone mention this so I'll throw in this use case as well: Disk-backed data structures. From what I've seen here, if you have a memory mapped data structure relying on hashes you'd be unable to take advantage of a hash function that doesn't allow for a seed to be passed in. Even then, if the implementation is opaque it'd be unwise as an implementer to rely on it not changing between versions unless it's explicitly called out as static. I saw @bcmills mention that @randall77 had rejected the idea in the golang-dev thread because of the specification needed, but that would be required in this case.

To make things more concrete: suppose you're building a database system that is writing a log to a mmap'd file and generating hash values to verify data. If the hash function changes on different architectures and releases, it would not work at all.

tv42 commented 7 years ago

@bcmills

if we expose a hash function, we should randomize it so that folks won't accidentally rely on the particulars.

Please consider an API that lets callers rerandomize that. Something like func New() Hasher.

E.g. a good DoS-safe map implementation would randomize differently for different variables, and even rerandomize when growing.

If you let a long-running process run with a singular randomization, that lets attackers probe it and discover the logic behind the randomization.

tv42 commented 7 years ago

@ScottMansfield Persistent data structures sound like a use case that should specify their own hashing, e.g. say "xxhash of a byte sequence consisting of ..."; the hash function proposed here is meant for very different use.

rsc commented 7 years ago

It seems like you would need to separate "create a hash function (pick a random seed)" from "start hashing a particular value with that function". If you do that, and if you are sensitive to allocation (not allocating), then you might have something like:

package somehash

type HashFunction struct { ... }
func New() HashFunction

type Hasher struct { ... }
func (f HashFunction) Start() Hasher

func (h *Hasher) Int(x int) 
func (h *Hasher) Bytes(x []byte)
func (h *Hasher) String(s string)
func (h *Hasher) Sum() uint64

That seems fine, but it doesn't seem like it needs to be in the standard library. Why not experiment outside the standard library with something like this first?

rsc commented 7 years ago

I think the path forward here would be to experiment outside the standard library, as mentioned in the previous comment. That would help us gather some evidence about how well it works, how much it's needed and so on. Closing until we have that evidence.

larytet commented 6 years ago

You can add to the list of "typical workarounds" requirement to calculate the hash, like in https://github.com/larytet/mcachego/blob/master/hashtable/hashtable.go#L141