vigna / sux

Succinct data structures in C/C++
Other
82 stars 17 forks source link

RecSplit MPHF mapping #2

Closed weaversa closed 5 years ago

weaversa commented 5 years ago

I am trying to get the mapping from keys to unique indices out of a RecSplit MPHF. I created a file with 4 strings and passed it to recsplit_dump_8, creating an MPHF. I modified recsplit_load.cpp (shown below) to display the mapping. However, the mapping is not a bijection. I also tried with a million keys and could not get an MPHF.

I've only been working with the tool for a few hours today, but I can't see how I'm using the interface incorrectly. Any help would be appreciated.

Here is the output of my run:

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   benchmark/function/recsplit_load.cpp

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    test.keys
    test.mphf

no changes added to commit (use "git add" and/or "git commit -a")
$ make recsplit
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump.cpp -o bin/recsplit_dump_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump128.cpp -o bin/recsplit_dump128_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load.cpp -o bin/recsplit_load_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load128.cpp -o bin/recsplit_load128_8
$ cat test.keys
fish
cat
dog
horse
$ bin/recsplit_dump_8 test.keys 2 test.mphf
Building...
Elias-Fano cumul sizes:  3.000000 bits/bucket
Elias-Fano cumul bits:   5.000000 bits/bucket
Elias-Fano cumul sizes:  1.500000 bits/key
Elias-Fano cumul bits:   2.500000 bits/key
Rice-Golomb descriptors: 1.500000 bits/key
Total bits:              5.500000 bits/key
Construction time: 0.000 s, 15366 ns/key
$ bin/recsplit_load_8 test.keys test.mphf
fish -> 1
cat -> 1
1 Duplicate!!!
dog -> 1
1 Duplicate!!!
horse -> 2
0 Missing!!!
3 Missing!!!

Here is my modification torecsplit_load.cpp to print the mapping.

$ cat benchmark/function/recsplit_load.cpp 
#include <algorithm>
#include <chrono>
#include <cstdio>
#include <fstream>
#include <iostream>
#include <random>
#include <sux/function/RecSplit.hpp>

#define SAMPLES (11)

using namespace std;
using namespace sux::function;

int main(int argc, char **argv) {
    if (argc < 3) {
        fprintf(stderr, "Usage: %s <keys> <mphf>\n", argv[0]);
        return 1;
    }

    ifstream fin(argv[1]);
    string str;
    vector<string> keys;
    while (getline(fin, str)) keys.push_back(str);
    fin.close();

    fstream fs;
    RecSplit<LEAF, ALLOC_TYPE> rs;

    fs.exceptions(fstream::failbit | fstream::badbit);
    fs.open(argv[2], std::fstream::in | std::fstream::binary);
    fs >> rs;
    fs.close();

    uint8_t *seen = (uint8_t *)calloc(keys.size(), sizeof(uint8_t));

    for (int k = 0; k < keys.size(); k++) {
      uint64_t h = rs(keys[k]);
      printf("%s -> %lu\n", keys[k].c_str(), h);
      if(seen[h] == 1) {
        printf("%lu Duplicate!!!\n", h);
      } else {
        seen[h] = 1;
      }
    }

    for (int k = 0; k < keys.size(); k++) {
      if(seen[k] == 0) {
        printf("%u Missing!!!\n", k);
      }
    }

    return 0;
}
vigna commented 5 years ago

You're right. We have working tests, so my guess is that serializing such a small map triggers some bug in the serialization process: if you build the map in the code, it works, and if you try, with, like, 200 strings it works, too. The problem is with serializing a very small number of keys. I suspect some off-by-one.

Thanks for the bug report. The workaround for the time being is just using more keys :).

vigna commented 5 years ago

Sorry—it's bullshit. Much easier. The RecSplit constructor which takes a file pointer (and which the dump tool uses) was implemented erroneously using the C getline(), which leaves the delimiter (e.g., \n) in the string. You are correctly reading the strings without the ending newline—hence the problem.

I'll fix this ASAP. In all our experiments we use dump128 and load128, so nobody every noticed this.

vigna commented 5 years ago

Fixed in 0dc722298fe1809f9693cfb9b541dd9c02c8d762.

weaversa commented 5 years ago

Thank you