kucci / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

A generalized API for hash functions #433

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
At present Equivalence<T>.hash(T) returns int (the same as Object.hashCode()) 
and is similarly limited to 32 bits of output.

If this limit were removed then this interface would be more useful for 
specialized data structures (e.g. the proposed BloomFilter<T>.)

One possibility is to change the signature from

  int hash(T t);

to

  void hash(T t, ByteBuffer dst);

which would write as many valid hash bytes as possible to dst.  The number of 
bytes written would be the minimum of dst.remaining() and the number of bytes 
the hash function is capable of generating.

e.g. equivalence.equals().hash(x, dst) would be similar to
dst.putInt(x.hashCode()) except that it would not throw 
BufferOverflowException, instead truncating to fit in dst (whether high-order 
or low-order bytes were dropped would depend on dst's current byte ordering.)

Original issue reported on code.google.com by fin...@gmail.com on 27 Sep 2010 at 1:36

GoogleCodeExporter commented 9 years ago
We've been puzzling over how to represent hash functions in a generalized way, 
and this is a very interesting idea for that.  I think it would take the form 
of a separate HashFunction interface, but perhaps Equivalence should connect to 
that... we'll consider.

Original comment by kevinb@google.com on 28 Sep 2010 at 1:37

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 26 Jan 2011 at 10:26

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 27 Jan 2011 at 6:16

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 13 Jul 2011 at 6:19

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 18 Jul 2011 at 3:53

GoogleCodeExporter commented 9 years ago
Here's an API sketch.

HashCode {
  int bits(); // asBytes().length * 8
  byte[] asBytes();
  int asInt(); // first 4 bytes little-endian
  long asLong(); // similarly. throws if < 8 bytes
}

HashFunction {
  int bits();
  Hasher newHasher();

  // shortcuts; not sure which exact set we want
  HashCode hashBytes(byte[]);
  HashCode hashChars(CharSequence);
  HashCode hashLong(long);
}

Hashing {
  HashFunction murmur3_32();
  HashFunction murmur3_32(int seed);
  HashFunction murmur3_128();
  HashFunction murmur3_128(int seed);

  // we choose a hash function for you. arbitrarily wide.
  HashFunction goodFastHash(int minimumBits);

  HashFunction md5(); // adapters of MessageDigest
  HashFunction sha1();
  HashFunction crc32(); // adapters of Checksum
}

Hasher {
  Hasher putInt(int);
  Hasher putLong(long);
  // etc. etc.

  HashCode hash();
}

Ideally, I would like to change Equivalence.hash() to return HashCode, allowing 
it to take part in all this fun.

Original comment by kevinb@google.com on 1 Aug 2011 at 9:09

GoogleCodeExporter commented 9 years ago
On Hasher I left out

  putBytes(byte[])
  putBytes(byte[], int, int)
  putChars(CharSequence) // hashes bytes directly, no encoding
  putChars(CharSequence, Charset) // unsure if we need this

Incidentally, the multibyte methods like putInt() are specified to be 
equivalent to calling putBytes() with the little-endian representation of that 
value.

(Endianness shouldn't matter to most users; for those it does it will likely be 
for compatibility with native code in which case LE is more commonly desired; 
and otherwise you can simply use Ints.toByteArray() and so forth to get the BE 
representation.)

Original comment by kevinb@google.com on 1 Aug 2011 at 9:12

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Kevin left out this:

interface Funnel<T> {
  void funnel(T, Sink);
}

For a List<T>, and a Funnel<T>, you would iterate the elements and pass them to 
the funnel, or if T is a fixed type, you could just extract the relevant parts 
directly. 

Please note that this only works for lists. It's just broken for Set. So when 
you have a collection, the best (/the only) option is really to just extract 
the collection.hashCode() value.

Original comment by andr...@google.com on 13 Oct 2011 at 7:34

GoogleCodeExporter commented 9 years ago
It isn't exactly that it's "broken for Set"; it's simply that *you* have to 
ensure that the collection is in a deterministic order before you feed the data 
in. So it's fine for SortedSet or for a LHS that you always populate in a 
specific way, while it's broken for some Lists that people "use like" 
Sets/Multisets.

Original comment by kevinb@google.com on 16 Oct 2011 at 3:31

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Right, I always forget SortedSet. About LHS, hmm, not that confident with that. 
Its order doesn't matter for its equals() or its hashCode(), but it would 
matter for HashCode, which may lead to surprizes.

I think a safe rule to advertize for when one can hash collections like that, 
is "whenever equals()/hashCode() of the collection depend on the iterator's 
ordering of the elements". With this, if two collections are equal, they will 
_always_ have the same HashCode, which is nice.

Original comment by jim.andreou on 16 Oct 2011 at 6:23

GoogleCodeExporter commented 9 years ago
I don't know; I think we define the problem away by not *having* methods that 
accept a Collection. You're responsible for feeding things in, you're 
responsible for what order you feed them in.

Original comment by kevinb@google.com on 16 Oct 2011 at 8:38

GoogleCodeExporter commented 9 years ago
Agreed, not our problem per se, since we are explicit in that "ordering 
matters". All I'm talking about is a simple rule for guidance of users who do 
face this question (the other approach is simply collection.hashCode() and 
using that). Not the sort of thing that goes in a javadoc; more like the answer 
to a future stackoverflow question.

Original comment by jim.andreou on 17 Oct 2011 at 2:50

GoogleCodeExporter commented 9 years ago
I'm marking this issue as completed since the release of common.hash.

Original comment by wasserman.louis on 20 Jan 2012 at 4:47

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:15

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:09