typelevel / algebra

Experimental project to lay out basic algebra type classes
https://typelevel.org/algebra/
Other
378 stars 69 forks source link

Add Hash, fix some minor issues #38

Closed johnynek closed 6 years ago

johnynek commented 9 years ago

Todo: add some hash laws.

Question: should we add LongHash (and LongLongHash)? It seems like those are the main things we use in algebird (LongLong in HyperLogLog).

avibryant commented 9 years ago

I think LongHash and StringHash are the two critical instances.

non commented 9 years ago

From my point of view, the one necessary law to encode is that eqv(a, b) implies hash(a) == hash(b).

avibryant commented 9 years ago

This may be out of scope but do we want to encode the idea of an independent hash family, here? At least in algebird, just a single hash function is not that valuable.

avibryant commented 9 years ago

The laws there would be that for a family of hashes given by hash(a,i), eqv(a,b) implies hash(a,i) == hash(b,i) for all i, and that hash(a,i) is uncorrelated with hash(a,j) for all i,j

non commented 9 years ago

I like that idea (an in fact the idea of corrolation is useful for testing a single hash function too) but defining it correctly is probably outside the scope of a single property, i.e. you'd probably need to build a histogram of a lot of calls and inspect it (and occasionally the test might fail).

Or is there a better strategy?

johnynek commented 9 years ago

how about something like:

trait HashFamily[A] extends Hash[A] {
  override def hash(a: A) = hash(a, 1)
  def hash(a: A, idx: Int): Int
}

I guess want Prob(hash(a, x) == hash(a, y)) = 1.0/Int.MaxValue.

Testing that possible (approximately) using Chebyshev inequalities.

non commented 9 years ago

I like that design.

I agree that standard strategies for testing RNGs should be able to test hashing functions (and families of functions) as well.

non commented 9 years ago

(We have some tests I could port over from Spire. Relatedly, here's a chapter on testing RNGs by John Cook.)

avibryant commented 9 years ago

Feels like we might want to add size to HashFamily? But otherwise that feels right.

johnynek commented 9 years ago

@avibryant yes. I guess we really want a finite set type. I guess since we don't have dependent types, we'll just have the size and require that users pass something in the range.

non commented 9 years ago

I would have guessed that for many of these things any Int value would work. Is that not true? A lot of RNG functions I'm familiar with treat all the Int values as unsigned and are totally fine with any value.

avibryant commented 9 years ago

@non I guess in that case it can return Int.MaxValue as size? :) Just feels more comfortable to have a way to check rather than running into trouble by passing an arg that is out of bounds.

non commented 9 years ago

Sure, that seems fine.

non commented 9 years ago

So I presumptuously added to this PR in order to get a basic statistical test in place. What do you all think?

I'm not a statistician, so I based this test on code from here. I feel like the test is statistically-valid but I may be wrong.

Importantly, I noted in the commit message that since Hash[A] and Order[A] both extend Eq[A] (and both define an on method) we have created a problematic diamond inheritance situation. For now I commented out Hash#on but that's not necessarily a great solution. Here are some options that I see:

  1. Have Hash[A] stop extending Eq[A]
  2. Have Hash[A] extend Order[A]
  3. Create a Hash[A] with Order[A] super trait that most things could use.
  4. Make my commenting permanent and delete the code.

I think complex numbers mean that (2) is not great. Almost all uses of hash codes rely on Eq as well so I could imagine resistance to (1). I find (3) ugly but maybe it is OK? (4) seems unsatisfying.

What do you all think? This is a place where Algebird has done a lot more work than Spire so I'm happy to defer to @johnynek and @avibryant here.

johnynek commented 9 years ago

Actually, we never found a great solution for hashing in algebird and tried many different ideas. Count-min-sketch uses the hash family approach Avi mentioned (though without an explicit size). Other algorithm use a fixed hash and require a serialization which is used to generate a hash on the item in question.

HyperLogLog, to count N uniques, needs something like log N + 1/sqrt(error) bits. So to count a few billion, something like 50 bits are needed. A long would probably be enough, but since you are going to use a byte later in the algorithm, it costs nothing to use a hash up to 256 bits in length, which can clearly count astronomical sets. We actually use murmur128 and so, again, Hash or LongHash would both be insufficient.

Hash is really something like an equivalence, so I wondered if we wanted:

/**
 * Law is Eq.eqv(a, b) implies Eq.eqv(hash(a), hash(b))
 */
trait Hash[K, R] extends Eq[K] {
  def resEq: Eq[R]
  def hash(k: K): R
}

But then, that felt like a false generality, because really hashing is more than that: it is related to a random projection of the item onto the interval [0, 1), and without that you've lost the ability to do much with it (like build hash tables, or all the nice sketch algorithms).

So that left us with either something like: K => Real or practically: K => Int, K => Long, K => (Long, Long). Since scala specializes Tuple2 on Long, I figured the later would be fast enough. You could build a Hash from a LongHash and a LongHash from a LongLongHash.

That said, I could never get much momentum behind it, and I find designs without buy in are often bad designs, so in is still an open PR in algebird, bitrotting.

For this minimal case, I tend to think the easiest way forward would be to move the on method into the companion objects and live with it. That or, as you suggest, make an OrderHash or something. Interesting to note, we are adding just such a typeclass to scalding now with an additional constraint: that we can serialize (and that is an equivalence) and we can compare on the serialized values and get the same result. Seems like a big typeclass, but it has several nice laws and it is exactly what Hadoop needs (or spark could use it) for external (on disk) partitioning of a stream by keys (for which you have an instance of this typeclass).

That's my brain dump on this.

avibryant commented 9 years ago

A further brain dump:

non commented 9 years ago

It may be too far-afield but later tonight I'm going to post some demo code involving Spire's BitString[A] type class, a new Long128 type I just wrote, and a possible GenHash[A, O] and GenHashFamily[A, O] abstraction.

avibryant commented 9 years ago

I guess Hash seems plausibly more part of spire than algebra...

johnynek commented 9 years ago

It would be nice, in my opinion, if we could get this issue fixed here, not in spire and algebird.

That said, it is not very algebraic, and only touches the classes here in that it relates to Eq.

Maybe I should replace this PR with just the bug fixes and remove hash, and then we can add a separate Hash PR.

avibryant commented 9 years ago

:thought_balloon: Maybe we should just model HashFamily[A] as Hash[(Int,A)]; that way we don't need a specific typeclass for it.

johnynek commented 9 years ago

@avibryant very interesting idea

non commented 9 years ago

I wasn't trying to suggest Hash[A] should go in Spire instead -- I was just mentioning that BitString[A] ends up being a useful abstraction for trying to model generic hashes (at least in the code I was playing with). The nice thing here is that you can abstract the hashing algorithm (or family) from the data type used, as long as you have a few constants available, with things like rotation, shifting, masking, and so on.

(Also, Spire has a notion of how to generate random values for a generic type using type classes using Dist[A] which would also be useful for the kind of hash families it looks like you use in count-min sketch.)

non commented 9 years ago

Hey, so I am still playing around with a possible design, but do you think this is something we need for an initial algebra milestone? Maybe we should wait a bit and add it later? Do you think I should merge the current design?

avibryant commented 9 years ago

I'd rather we get this right, and there's no rush to get it into the first milestone (algebird wouldn't use it right away anyway).

non commented 9 years ago

Yeah, that's my thinking as well.

johnynek commented 9 years ago

+1 to not rushing.

But we should get the specialization fixes and the impicit Equiv from Eq.

non commented 9 years ago

Oh, right! I'll make another PR with those now.

non commented 8 years ago

Picking this back up, I have a concrete proposal:

Create a simple Hash[A] type class

The main consumers of this would be folks that want to implement hash tables, hash sets, and other structures which benefit from a single hashing function. It also enables us to provide a hash function with a good default distribution (unlike .hashCode).

For example, something like this (sans derived methods like .contramap, .hashMask, and so on):

trait Hash[@sp A] {
  def hash(a: A): Int
}

We could probably provide a default Hash[A] that re-hashes the result of .hashCode.

We might also choose to support re-hashing, since many structures (e.g. those in Debox) benefit from being able to re-hash on collisions. That might end up being too complex, so we can also leave it out.

Hash[A] would have laws written in terms of a related Eq[A] instance (although I don't think it should extend Eq) and would also have its own probabilistic laws in terms of the distribution of its outputs.

Plan to support a HashFamily[A] type class

I'd like to use a separate type class to support hash families. I'm happy to support size/cardinality, or to assume that you can provide 0 to Int.MaxValue hash functions, whichever seems better. I also imagine that you could generate a Hash[A] instance from HashFamily[A] given an Int argument.

Again, we could provide a default HashFamily[A] implementation in terms of .hashCode with re-hashing.

Considerations

I'd like to avoid making Hash[A] return anything other than a primitive (to avoid boxing). However, if we decided we wanted more bits, we could use Long instead of Int. If we really needed more bits we could also support a "larger" hashing type class, such as Hash128 or Hash256, or something similar. My sense is that HashFamily[A] might cover this case but I'm open to something that returns Array[Int] or Array[Byte] if that seems better.

What do you all think? I feel like (1) might not be controversial, in which case we can move forward with it now.

rklaehn commented 8 years ago

One remark: I do not want this to automatically fall back on scala.util.hashing.Hashing. That causes a lot of problems in case the default hashing makes no sense (e.g. arrays, functions). If you have an automatic fallback, you get instances where you would rather not have them!

denisrosset commented 6 years ago

Now that cats has a Hash class, this is mostly redundant; apart from the nice statistical test for the distribution of hashes!

What should we do?

johnynek commented 6 years ago

let's ditch it. We can upstream the statistical test if we want.