google / guava

Google core libraries for Java
Apache License 2.0
50.03k stars 10.86k forks source link

Fast Hash-function, as for instance xxHash3 #6312

Open JohannesLichtenberger opened 1 year ago

JohannesLichtenberger commented 1 year ago

Do you plan on implementing more hash functions? My use case is for a temporal data store, which stores XML or JSON representations in a tree structure (and eventually in a custom binary format on disk). I'm optionally storing hashes for each node, whereas the ancestor chain is always adapted through a rolling hash whenever a node is inserted/deleted/modified. However, I've been using SHA256 truncated to the first 128 bits using BigInteger for some computations of the rolling hash. For sure, the whole approach is really stupid, as I thought maybe to reduce hash collisions, I'd probably need a hardware-accelerated SHA256 hash to reduce the possibility of hash collisions, as I'm using the hashes on the one hand for a diff (which is probably not needed anymore, as I'm now doing change tracking between revisions). On the other hand, for a simple optimistic locking schema, if a subtree changes between a read and a write, abort the transaction...

So I guess, for my use cases, something much faster would be great, for instance, xxHash3. Do you have any plans, or should I use another library for this use case? In general, I read that Guava, regarding hashing doesn't have the fastest implementations, but not sure if that's true. At least the @Beta annotation is still around for many years now.

kluever commented 1 year ago

Hey Johannes,

We have a small comparison of the hash functions available here.

tl;dr is that SHA256 is quite slow (it uses Java's MessageDigest under the hood), and not likely to be improved upon. Based on your notes, it sounds like you only need 128 bits, so perhaps murmur3_128() would work better for you? It should be about 3x faster than SHA256 according to our benchmarks.

Also, if you don't need stability across runs, consider just using goodFastHash(128) and we'll give you something "good and fast".

As for plans to implement additional hashing algorithms: we don't have anything on our radar. I'm by no means a hashing expert, but I haven't heard of xxHash3 before. I think we'd need some additional evidence that it would be broadly useful before we'd accept an additional hash functions into the library.

HTH, -kak

ben-manes commented 1 year ago

fwiw, I believe the biggest caveat for performance is that the api design results in allocations or similar waste. That's perfectly fine for most use cases that an application programmer might have, but not ideal for performance sensitive cases like java.util.HashMap. Generally, I start with Guava's as a first pass to work out all of the other complex logic. If performance is a concern then profiling will quickly highlight if this is the bottleneck, and if so then I will switch to an inline hash function. Hopefully that rule of thumb answers your question of if Guava's hashing is a good choice for now in your project.

JohannesLichtenberger commented 1 year ago

I'll even switch to 64Bit hashes, I guess:

So, I'm not really sure as in both cases hash collisions might matter, even though I'll still check at least in the diffing case the unique nodeKeys and maybe also the names/values of the nodes. But still, hmm. I've used BigInteger for rolling hash computations after the real hash for new leaf nodes has been computed to update ancestor nodes, but computations in a data store are super slow, so that was a rather bad idea, of course maybe due to the allocations, too and the slow SHA256 hash computations.

kevinb9n commented 1 year ago

From its birth I viewed common.hashing as overkill for just balancing some in-memory hash buckets. (Though it would be one way to prevent flooding attacks...)

(I also viewed it as something we could Maybe Make Faster Later and well that never happened.)

On Thu, Jan 26, 2023 at 2:54 PM Ben Manes @.***> wrote:

fwiw, I believe the biggest caveat for performance is that the api design results in allocations or similar waste. That's perfectly fine for most use cases that an application programmer might have, but not ideal for performance sensitive cases like java.util.HashMap. Generally, I start with Guava's as a first pass to work out all of the other complex logic. If performance is a concern then profiling will quickly highlight if this is the bottleneck, and if so then I will switch to an inline hash function. Hopefully that rule of thumb answers your question of if Guava's hashing is a good choice for now in your project.

— Reply to this email directly, view it on GitHub https://github.com/google/guava/issues/6312#issuecomment-1405778545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHEFF3DD7XX6IOMVJRQENTWUL6BHANCNFSM6AAAAAAUHPTFNM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Kevin Bourrillion | Java Librarian | Google, Inc. | @.***

ghost commented 1 year ago

Hello sir , I am new to open source contribution. I already know java , my tech stacks & tools includes C, C++ , Python , Java, JavaScript , HTML , CSS , SQL , Bootstrap, ReactJS, ExpressJS, NodeJS & Git . I need a little help from your side to contribute to these amazing projects.

kevinb9n commented 1 year ago

Please see https://github.com/google/guava/wiki/HowToContribute

A most excellent way to help would be to find a method that we aren't unit-testing very well yet, and write a better test for it.