jwerle / murmurhash.c

MurmurHash3 general hash bashed lookup function implementation
MIT License
80 stars 28 forks source link

Murmur hash function test results are not as expected ? #4

Open BraveLandLin opened 1 year ago

BraveLandLin commented 1 year ago

What OS are you using (uname -a, or Windows version)?

Linux hzscn008 5.10.27-051027-generic #202103310028 SMP Thu Apr 1 02:16:48 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

What programming language are you using (C/C++/Go/Rust)?

C++

g++ --version g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

What did you expect to see and what you saw instead?

I've tested your murmurhash.c, which implements the Murmur Hash algorithm. My testing approach involves converting IPv4 addresses to 32-bit integers. After applying the hash function and modulo operator, I place them into separate buckets. Subsequently, I perform statistical analysis on the bucket counts and check for deviations. Here's my test code,save it as test_murmur.cpp,

#include <cstdio>
#include <cstdlib>
#include <cstring>
#include "murmurhash.h"
#include <sys/socket.h>
#include <arpa/inet.h>
#include <map>
#include <iostream>
#include <numeric>
#include <algorithm>

int main (int argc, char **argv) 
{
  if (argc != 2)
  {
    printf("Usage: %s buckets_num\n", argv[0]);
    return 1;
  }
  int port_num = atoi(argv[1]);
  struct in_addr netAddr;
  const char *ip_address_str = "192.168.8.1" ;
  if (inet_pton(AF_INET, ip_address_str, &netAddr) <= 0) {
      perror("inet_pton");
      return 1;
  }
  uint32_t start = ntohl(netAddr.s_addr);
  uint32_t hash_value = 0;
  char *ip = NULL;
  std::map<uint32_t, uint32_t> myMap;
  uint32_t port = 0;

  for (uint32_t i = start; i< start+100000;i++)
  {
    hash_value = murmurhash((const char *)&i, sizeof(i), 0x8edd);
    port = hash_value % port_num;
    myMap[port]++;
  }

  //get max
  auto it_max = std::max_element(myMap.begin(), myMap.end(), [](const auto& lhs, const auto& rhs) {
      return lhs.second < rhs.second;
  });

  //get min
  auto it_min = std::min_element(myMap.begin(), myMap.end(), [](const auto& lhs, const auto& rhs) {
      return lhs.second < rhs.second;
  });
  // get total
  int sum = std::accumulate(myMap.begin(), myMap.end(), 0, [](int acc, const auto& p) {
      return acc + p.second;
  });

  size_t size = myMap.size();

  // get average
  double average = sum / size;

  std::cout <<"positive deviation: " << (it_max->second- average)/average*100 << "%" <<std::endl;
  std::cout <<"negative deviation: " << (it_min->second- average)/average*100 << "%" <<std::endl;

  return 0;
}

To compile:

g++ -o test_murmur test_murmur.cpp  murmurhash.c  -std=c++17

To run:

./test_murmur 512   # (buckets number)

Since the Murmur algorithm has passed chi-squared and avalanche tests, I initially assumed that the counts of each bucket would be almost the same, resulting in a deviation close to zero. However, as the total number of buckets increases, the deviation becomes significantly larger. For instance, when the number of buckets is set to 512, the deviation is as follows:

Positive deviation: 18.9744% Negative deviation: -21.0256%

Do you have any insights on my test results? I don't believe the deviation is acceptable. What do you think? Thanks in advance .

jwerle commented 1 year ago

hi @BraveLandLin have you run this test against other implementations? I am curious to know the results.

hikerlin commented 1 year ago

I certainly did. For example, I tested the implementation on 'https://en.wikipedia.org/wiki/MurmurHash,' and they all produced similar results. Therefore, the deviation may not be due to your implementation; it could be inherent to the algorithm itself. I'm just curious as to why the algorithm exhibits such a significant deviation