sangupta / murmur

Pure Java implementations of Murmur hash algorithms
Apache License 2.0
73 stars 21 forks source link

Why Murmur3 hashes differ from guava hashes? #3

Closed jcornaz closed 5 years ago

jcornaz commented 6 years ago

Hello,

My understanding is that Murmur3.hash_x64_128 (of this project) should return the same result (in bytes) as Hasing.murmur3_128().hashBytes (of guava library).

But It doesn't. May I ask if it is expected to be different and why?

Here is my code, just in case I made an obvious mistake you could point out. (the code is written Kotlin, but should be easily understandable)

import com.google.common.hash.Hashing
import com.sangupta.murmur.Murmur3
import org.junit.Test
import java.nio.ByteBuffer
import java.util.*
import kotlin.test.assertTrue

private const val SEED = 42

class Murmur3Test {

  @Test
  fun murmur32shouldCorrespondToGuavaHashes() {
    val guava = Hashing.murmur3_32(SEED)
    repeat(1000) {
      val data = UUID.randomUUID().toByteArray()

      val guavaResult = guava.hashBytes(data).asBytes()
      val murmurResult = Murmur3.hash_x86_32(data, 16, SEED.toLong()).asBytes()

      assertTrue(Arrays.equals(guavaResult, murmurResult))
    }
  }

  @Test
  fun murmur128shouldCorrespondToGuavaHashes() {
    val guava = Hashing.murmur3_128(SEED)
    repeat(1000) {
      val data = UUID.randomUUID().toByteArray()

      val guavaResult = guava.hashBytes(data).asBytes()
      val murmurResult = Murmur3.hash_x64_128(data, 16, SEED.toLong()).asBytes()

      assertTrue(Arrays.equals(guavaResult, murmurResult))
    }
  }
}

fun UUID.toByteArray(): ByteArray {
  val buffer = ByteBuffer.allocate(16)

  buffer.putLong(mostSignificantBits)
  buffer.putLong(leastSignificantBits)

  return buffer.array()
}

fun LongArray.asBytes(): ByteArray {
  val buffer = ByteBuffer.allocate(size * 8)

  forEach { buffer.putLong(it) }

  return buffer.array()
}

fun Long.asBytes(): ByteArray {
  val buffer = ByteBuffer.allocate(8)

  buffer.putLong(this)

  return buffer.array()
}
sangupta commented 5 years ago

Hi @jcornaz - I coded this library using the C++ generated hashes and confirmed that they were the same. It has been quite a long time and I would need some time to debug this issue. My bad on noticing it this late.

sangupta commented 5 years ago

@jcornaz

I just added MurmurGuavaTest to test the same. The hash generated are same, its the endian-ness of the result that makes it look different.

I will probably add a converter to make it equivalent to Guava. I also checked the code in C and Java have different endian-ness.

jcornaz commented 5 years ago

The hash generated are same, its the endian-ness of the result that makes it look different.

Ok, make sense.

Thanks for your investigation ;-)

I let you decide if you want to close this issue or rename it.

sangupta commented 5 years ago

@jcornaz

The long hash is the same when computed in value (as long) between both Guava and Murmur. I have added documentation on how to convert long to byte[] in both big-endian and little-endian format (refer 56545af0a465aa0e57a39175f541150d1ef40d28).