jermp / pthash

Fast and compact minimal perfect hash functions in C++.
MIT License
182 stars 26 forks source link

Use pthash with unordered_map #9

Closed zsnjuts closed 1 year ago

zsnjuts commented 1 year ago

This issue says that pthash has not map implementation, but It seems that we can use pthash as the hash function of std::unordered_map.

std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);

To my surprise, pt_umap even performs worse than the default hash function. Is there a problem with the use of the above code? Here is my benchmark code:

#include <iostream>
#include <string>
#include <set>
#include <vector>
#include <algorithm>
#include <unordered_set>
#include <unordered_map>
#include <map>
#include <type_traits>
#include <cstdlib>
#include <unistd.h>
#include <thread>
#include <typeindex>
#include <functional>
#include <chrono>

#include "pthash/include/pthash.hpp"
#include "pthash/src/util.hpp"

using namespace std::chrono;

int main()
{
    /* Generate 10M random 64-bit keys as input data. */
    static const uint64_t num_keys = 10000000;
    static const uint64_t seed = 1234567890;
    std::cout << "generating input data..." << std::endl;
    std::vector<uint64_t> keys = pthash::distinct_keys<uint64_t>(num_keys, seed);
    assert(keys.size() == num_keys);

    /* Set up a build configuration. */
    pthash::build_configuration config;
    config.c = 7.0;
    config.alpha = 0.99;
    config.minimal_output = true;  // mphf
    config.verbose_output = false;

    /* Declare the PTHash function. */
    typedef pthash::single_phf<pthash::murmurhash2_64,         // base hasher
            pthash::compact_compact,  // encoder type
            true                    // minimal
    > pthash_type;
    pthash_type f;

    /* Build the function in internal memory. */
    std::cout << "building the function..." << std::endl;
    f.build_in_internal_memory(keys.begin(), keys.size(), config);

    // normal, use the default hash function
    std::unordered_map<uint64_t, uint64_t> normal;
    for (uint64_t x : keys) {
        normal[x] = x;
    }

    int collision = 0;
    for (size_t bucket = 0; bucket < normal.bucket_count(); bucket++) {
        if (normal.bucket_size(bucket) > 1) {
            collision ++;
        }
    }
    std::cout << "normal collision=" << collision << ", total=" << normal.bucket_count() << std::endl;

    // vector with pthash
    std::vector<uint64_t> pt_map(keys.size());
    for (uint64_t x : keys) {
        pt_map[f(x)] = x;
    }

    // unordered_map with pthash
    std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);
    for(uint64_t x : keys) {
        pt_umap[x] = x;
    }

    collision = 0;
    for (size_t bucket = 0; bucket < pt_umap.bucket_count(); bucket++) {
        if (pt_umap.bucket_size(bucket) > 1) {
            collision ++;
        }
    }
    std::cout << "ptu collision=" << collision << ", total=" << pt_umap.bucket_count() << std::endl;

    // correctness
    for (int i = 0; i < keys.size(); i += 100) {
        uint64_t key = keys[i];
        assert(pt_map[f(key)] == normal[key]);
        assert(pt_umap[key] == normal[key]);
    }

    // performance
    auto start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = normal[keys[i]];
    }
    std::cout << "normal cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;

    start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = pt_map[f(keys[i])];
    }
    std::cout << "pt cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;

    start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = pt_umap[keys[i]];
    }
    std::cout << "ptu cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;
}

The results are:

generating input data...
building the function...
normal collision=2327031, total=13169977
ptu collision=0, total=10000019
normal cost=59043ms
pt cost=28636ms
ptu cost=60522ms
jermp commented 1 year ago

You should not use a minimal perfect hash function inside a std::unordered_map. Your pt_map vector is where your data should be stored, as indexed by pthash. Please, read the issue you referenced: it is also explained there how to proceed.

You see, in fact, that pthash is 2X than a std::unordered_map with (essentially) optimal space usage: you only pay the cost for your data. You also have no collisions. The only downside is that a minimal perfect hash function is static by design.

Hope my explanation helps.

jermp commented 1 year ago

Can I close this issue or do you need further help/explanation?

jermp commented 1 year ago

I'm closing this but feel free to text me if needed.

peekxc commented 1 year ago

I think one of the issues is that pthash, like several other PMHF's, is not order-preserving.

Most PMHF's map to the range {0, 1, ..., m - 1}, but without the order-preserving property you cant really create e.g. a flat array map implementation without using an offset array, which needs at least O(mlog2(m)) bits to store