gvinciguerra / PGM-index

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
https://pgm.di.unipi.it/
Apache License 2.0
784 stars 92 forks source link

multi-level PGM-index and its size #26

Closed authwork closed 3 years ago

authwork commented 3 years ago

@gvinciguerra Hello, I have done a experiment to build multi-level PGM-index. Why? I use it to support very large integer, like a 256-bit long integer. Assume I have a uint256_t num, I set num[0] is a uint128_t with high 128 bits of num, num[1] is also a uint128_t with low 128 bits of num. To index with PGM-index

typedef PGM-index<uint128_t, value> type1;
typedef PGM-index<uint128_t, type1 *> type2;

for insert. I will first insert a type1 variable into type2, then insert the real value to the corresponding type1, such as:

type2 index2nd;
auto pp = index2nd.find(num[0]);
if(pp == index2nd.end()){
    type1 *pp2 = new type1();
    index2nd.insert_or_assign(num[0], pp2);
    pp2->insert_or_assign(num[1], value);
}
else{
    pp.second->insert_or_assign(num[1], value);
}

To calculate the size of the whole index:

uint64_t size = index2nd.size_in_bytes();
for (auto &e : index2nd) {
      size += e.second->size_in_bytes();
}

Does size correctly calculate the memory size of index2nd?

gvinciguerra commented 3 years ago

Yep, the computation of overall memory (data + index) seems correct.

authwork commented 3 years ago

@gvinciguerra I found more surprising things: I test the same program on different machines, but get totally different results by using PGM-index. Using c++ 9, the index size is 205532009 bytes. Using c++ 7, the index size is 309379385 bytes.

#include <iostream>
#include <sys/random.h>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <string>
#include <time.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include <unordered_map>
#include "pgm/pgm_index.hpp"
#include "pgm/pgm_index_dynamic.hpp"
#include "pgm/pgm_index_variants.hpp"

struct T{
    uint64_t a, b, c;
};
typedef pgm::DynamicPGMIndex<uint64_t, T> u64L;

int main(){
    int Max_J = 0x8F, Max_i = 0xFFFF;
    uint64_t key = 0;
    struct timeval start, end;
    int checkRan = 1;
    long time_len;
    gettimeofday(&start, NULL);
    pgm::DynamicPGMIndex<uint64_t, u64L*> d_index;
    for (uint64_t j = 0; j < Max_J; j++) {
        auto iter = d_index.find(j);
        for (uint64_t i = 0; i < Max_i; i++) {
            T a;
            a.a = i;
            a.b = 2 * i;
            a.c = 3 * i;
            if (iter == d_index.end()) {
                u64L *p = new u64L();
                d_index.insert_or_assign(j, p);
            }
            iter = d_index.find(j);
            key = std::rand();
            if(checkRan){
                key = i;
            }
            iter->second->insert_or_assign(key, a);
        }
    }
    gettimeofday(&end, NULL);
    time_len = 1000000 * (end.tv_sec - start.tv_sec)
            + (end.tv_usec - start.tv_usec);
    std::cout << "It costs " << time_len << std::endl;
    start = end;
    if (checkRan) {
        for (uint64_t j = 0; j < Max_J; j++) {
            auto iter = d_index.find(j);
            for (uint64_t i = 0; i < Max_i; i++) {
                T b = iter->second->find(i)->second;
                if (b.a != i || b.b != 2 * i || b.c != 3 * i) {
                    printf("failed\n");
                    return -1;
                }
            }
        }
    }
    uint64_t table_size = d_index.size_in_bytes();
    for (auto &e : d_index) {
        table_size += e.second->size_in_bytes();
    }
    printf("FP table second index is %lld\n", table_size);
    return 0;
}
gvinciguerra commented 3 years ago

Could you check if you are using the master branch version of the PGM on both machines?

authwork commented 3 years ago

Yes, they are under the same version @ e2e82d77c50130c11941479301b9f835d2db1bb9.

Oh, it seems there are new updates.:)

@gvinciguerra After update to the last version, both of them become 309379385 bytes.

gvinciguerra commented 3 years ago

Thanks for checking.

It's strange, under gcc version 9.3.1 and commit e2e82d7 I get 309 MB as well. Indeed, the overall space of the u64L instances should be at least Max_J * Max_i * (sizeof(uint64_t) + sizeof(T)) = 0x8F * 0xFFFF * (8 + 24) = 300 MB.

Could you check the output of gcc -v?

authwork commented 3 years ago

It is gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

I should backup previous code, when I reset the head to commit e2e82d7, I found I cannot reproduce the result of 205M, although I used to get this result for multiple times in the past.