twitter / algebird

Abstract Algebra for Scala
https://twitter.github.io/algebird
Apache License 2.0
2.29k stars 345 forks source link

CountMinSketch[K] assumes working equals method on K. #497

Open johnynek opened 8 years ago

johnynek commented 8 years ago

For small count-min sketches, you create CMSItem:

https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L467

But to get the frequency of K we use ==: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L479

if you have CMSItem[Array[Byte]] as happens in scalding for sketches of tiny key spaces (which might come up in unit tests or toy examples), this means we always reurn a frequency of 0 for CMSItem[Array[Byte]].frequency unless you happen to pass the same exact array in.

trying to bite my tongue about the design flaw in Java that equals is eq for Arrays.

johnynek commented 8 years ago

A possible solution here is to add an def equiv(a: K, b: K): Boolean method to CMSHasher[K] and then implement this for Array[Byte].

PS: @non this is why I think any Hashing typeclass should have equivalence: you often need it, and it seems impossible to state any law on the typeclass without it (well, you could make probabilisitic statements about collisions).

johnynek commented 8 years ago

Also, the SparseCMS does not seem fixable at all with Array[T] keys. Since we are actively using it for Array[Byte] in scalding, and I guess CMSHasherByteArray: is broken if we allow the SparseCMS. https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L1353

We could fix this by making the table have the key which is the hash, and then check collisions using the equivalence in the CMSHasher, or we could remove the sparse business.

ianoc commented 8 years ago

We internally use structures often where having a graceful degradation from exact down to approximate is nice. (TSAR uses these heavily), so it sort of would be nice to keep it if we could. Requiring the Hasher type class to provide the equality also i think would be most preferable if not too painful. (We'd need to be careful of course about using any built in maps or other scala stuff... which is annoying)

isnotinvain commented 8 years ago

This is old, but, here is the ugly wrapper class I created in scalding to make arrays have efficient orderings and hash codes:

https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/HashEqualsArrayWrapper.scala

The problem is you have to make sure to wrap your arrays in these wrappers.

johnynek commented 8 years ago

that may be worth adding. The core issue is unsolved: we are using equality without communicating in the types that we are doing so. We need something like https://github.com/non/algebra/blob/master/core/src/main/scala/algebra/Eq.scala#L12

ianoc commented 8 years ago

When/if should we take the algebra dep into algebird I wonder, that was the original intent right? Or should we start going for more Equiv typeclass usage and less =='s anywhere?

johnynek commented 8 years ago

Good question. I hoped soon, then this cats/algebra conflict came up:

https://github.com/non/algebra/issues/142

The problem with Equiv is that there is a default for all types falling back to ==. So it is not much safer.

isnotinvain commented 8 years ago

Yeah unless we get away from use of == then users (and library implementors) just have to take care to wrap types when they have a bad ==, no way around that really.

Using an Equiv makes sense to me, maybe the migration path could be to require an Equiv wherever we use ==, and if one isn't provided, fall back to ==, then deprecate the fall back path... etc

johnynek commented 8 years ago

Scala provides a fallback Equiv (==) in the companion object already so that approach won't be much safer.

On Thu, Jan 21, 2016 at 3:38 PM, Alex Levenson notifications@github.com wrote:

Yeah unless we get away from use of == then users (and library implementors) just have to take care to wrap types when they have a bad ==, no way around that really.

Using an Equiv makes sense to me, maybe the migration path could be to require an Equiv wherever we use ==, and if one isn't provided, fall back to ==, then deprecate the fall back path... etc

— Reply to this email directly or view it on GitHub https://github.com/twitter/algebird/issues/497#issuecomment-173772130.

P. Oscar Boykin, Ph.D. | http://twitter.com/posco | http://pobox.com/~boykin

isnotinvain commented 8 years ago

Ok, well, what I said but replace scala's Equiv with something similar that has no fallback on == then?

sritchie commented 7 years ago

@isnotinvain we can definitely pick this up now that we have the algebra dep!