gvinciguerra / PGM-index

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
https://pgm.di.unipi.it/
Apache License 2.0
775 stars 91 forks source link

Add support for a vector of int as a key #27

Closed authwork closed 3 years ago

authwork commented 3 years ago

Hello! @gvinciguerra With the growing big data, sometimes we may need a vector of integers as the key to index some values. Currectly support is to use a hash function to combine such a vector of integer into a uint128_t number.

Is there any possible to extend the build-in ML model (f(x) = kx + b). For example, [x1, x2, x3, ..., xn], build a ML model f(x1, x2, x3, ..., xn) = c0 + c1x1 + c2x2 + ... + cnxn?

In this way, it could become more general. Currently, I use a multiple-level PGM index (#26) to implement this, but seems much more overhead (size, time, etc) than the single-level.

Many thanks.

gvinciguerra commented 3 years ago

What kind of queries do you need to support on the vector of integers? Orthogonal range queries? If so take a look at the MultidimensionalPGMIndex. If you only need to support lookups, then a hash table may be more appropriate.

authwork commented 3 years ago

Currently, I have some key-value pairs (uint160_t (32*5), value), how to use it as the key in dynamic PGM-index?

gvinciguerra commented 3 years ago

I will push some code to support the operation that you ask later this year 😉

authwork commented 3 years ago

@gvinciguerra

If I want to use MultidimensionalPGMIndex, how can I set it as KV pair as you show me at PGMIndex? And how to dynamically change it (add/delete).

struct KKKVVV{
    std::tuple<uint32_t, uint32_t, uint32_t, uint32_t, uint32_t> pk;
    VVV pv;
    operator std::tuple<uint32_t, uint32_t, uint32_t, uint32_t, uint32_t>() const {return pk; }
};
typedef pgm::MultidimensionalPGMIndex<5, uint32_t, 32> u32MPGM;
void multi_pgm_test(std::vector<KKK> key, std::vector<VVV> value){
    int number = key.size();
    std::vector<KKKVVV> data(number);
    struct timeval start, end;
    gettimeofday(&start, NULL);
    for(int i=0; i<number; i++){
        std::get<0>(data[i].pk) = key[i].a >> 32;
        std::get<1>(data[i].pk) = key[i].a & 0xFFFFFFFF;
        std::get<2>(data[i].pk) = key[i].b >> 32;
        std::get<3>(data[i].pk) = key[i].c >> 32;
        std::get<4>(data[i].pk) = key[i].d >> 32;
        data[i].pv = value[i];
    }
    u32MPGM pgm_index(data.begin(), data.end());
    for(int i=0; i<number; i++){
        for (auto it = pgm_index.range(data[i].pk, data[i].pk); it != pgm_index.end(); ++it){
            VVV b = it->pv;
            if (b.a != value[i].a || b.b != value[i].b || b.c != value[i].c) {
                printf("failed\n");
                return;
            }
        }
    }
    gettimeofday(&end, NULL);
    int time_len = 1000000 * (end.tv_sec - start.tv_sec)
            + (end.tv_usec - start.tv_usec);
    std::cout << "It costs " << time_len << std::endl;
    uint64_t size = pgm_index.size_in_bytes();
    printf("pgm size is %lld\n", size);
}
gvinciguerra commented 3 years ago

There is no support for this yet. Ideally, one could combine the design/implementation of DynamicPGMIndex with the one of MultidimensionalPGMIndex, with the proper extensions.

authwork commented 3 years ago

There is no support for this yet. Ideally, one could combine the design/implementation of DynamicPGMIndex with the one of MultidimensionalPGMIndex, with the proper extensions.

@gvinciguerra How about the first problem? It seems MultidimensionalPGMIndex only accepts for std::tuple data. What should I do to assign a structure data (KKKVVV)? The code I show before will give following error log:

/usr/include/c++/9/tuple:1277:29: error: incomplete type ‘std::tuple_size<const KKKVVV>’ used in nested name specifier
 1277 |     inline constexpr size_t tuple_size_v = tuple_size<_Tp>::value;

PGM-index/include/pgm/pgm_index_variants.hpp:917:37: error: no matching function for call to ‘pgm::MultidimensionalPGMIndex<5, unsigned int, 32>::encode(const KKKVVV&)’
  917 |             data.emplace_back(encode(x));
gvinciguerra commented 3 years ago

Does replacing this line

https://github.com/gvinciguerra/PGM-index/blob/4cdd5de6a941de4653da04e8cecd299e62b47788/include/pgm/pgm_index_variants.hpp#L910

with

        std::for_each(first, last, [&](const value_type &x) {

work for you?

authwork commented 3 years ago

@gvinciguerra After this change, it shows:

pgm.cpp:169:16: error: ‘const value_type’ {aka ‘const class std::tuple<unsigned int, unsigned int, unsigned int, unsigned int, unsigned int>’} has no member named ‘pv’
  169 |    VVV b = it->pv;

This happens at query the value from the structure KKKVVV.

for (auto it = pgm_index.range(data[i].pk, data[i].pk); it != pgm_index.end(); ++it){
    VVV b = it->pv;
    if (b.a != value[i].a || b.b != value[i].b || b.c != value[i].c) {
        printf("failed\n");
        return;
    }
}
authwork commented 3 years ago

@gvinciguerra

Another try is to convert the five integer into one large double value. But double value is very hard to get equal when searching, can the user define this equal condition, said abs(a-b) < 1e-30?

gvinciguerra commented 3 years ago

Another try is to convert the five integer into one large double value.

I do not recommend this: with a double you cannot express all the configurations of values of the five uint32_ts.

Anyway, as I said, I'll try to add the support for longer keys later this year ;)

authwork commented 3 years ago

@gvinciguerra Many thanks for your effort. How about this MultidimensionalPGMIndex issue?

@gvinciguerra After this change, it shows:

pgm.cpp:169:16: error: ‘const value_type’ {aka ‘const class std::tuple<unsigned int, unsigned int, unsigned int, unsigned int, unsigned int>’} has no member named ‘pv’
  169 |    VVV b = it->pv;

This happens at query the value from the structure KKKVVV.

for (auto it = pgm_index.range(data[i].pk, data[i].pk); it != pgm_index.end(); ++it){
  VVV b = it->pv;
  if (b.a != value[i].a || b.b != value[i].b || b.c != value[i].c) {
      printf("failed\n");
      return;
  }
}
gvinciguerra commented 3 years ago

Indeed the MultidimensionalPGMIndex act as a container for tuples, and in your example it stores only the pk member of struct KKKVVV. You should be able to modify the MultidimensionalPGMIndex class to store also pv alongside the tuple.