gvinciguerra / PGM-index

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
https://pgm.di.unipi.it/
Apache License 2.0
784 stars 92 forks source link

Scalability issue #24

Closed authwork closed 3 years ago

authwork commented 3 years ago

Hello,@gvinciguerra. I have done a comparison betweem PGM-index with unordered_map (hash map) by using

pgm::DynamicPGMIndex<__uint128_t, A> d_index; // A is a structure with multiple uint64_t
for(__uint128_t i=0; i<0xFFFFFFFFFFFFFFFF; i++){
    A a;
    a.a = (uint64_t)i;
    a.b = (uint64_t)2*i;
    a.c = (uint64_t)3*i;
    d_index.insert_or_assign(i, a);
}
A b;
A *c = &b;
for(__uint128_t i=0; i<0xFFFFFFFFFFFFFFFF; i++){
    auto iter = d_index.find(i);
    *c = iter->second;
    std::cout << b.a << " " << b.b << " " << b.c << std::endl;
}

The memory consumption of DynamicPGMIndex can achieve 98% and it will cause:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

while the case using unordered_map only has very low memory consumption. Theoretical speaking, Hash map should use much more memory (pointers in linked list, RB tree etc) Why this experiment show DynamicPGMIndex uses more memory? And how we solve it?

Another issue is mentioned in #17 How to achieve dynamic insertable set by using PGM index?

Many thanks

gvinciguerra commented 3 years ago

Hi @authwork. Your example requires at least sizeof(__uint128_t) 0xFFFFFFFFFFFFFFFF = 128 (2^64-1) bits, i.e. 295.1 exabytes of main memory. Are you sure it's not a hardware problem?

authwork commented 3 years ago

@gvinciguerra But hash map works fine tile now, it is running. DynamicPGMIndex crashes much faster than hash map. (i.e., the increase of memory consumption)

In this experiment, I just want to evaluate the run time memory consumption of DynamicPGMIndex and the c++ hash map. So I set a very large data (to evaluate the scalability). I understand it needs a very huge memory space. Does it suggests unordered_map could swap the memory data into disk?

gvinciguerra commented 3 years ago

If it crashes much faster it might be because it processes the inserts faster, and thus it reaches the memory limit of your machine in a shorter period of time.

The swapping of memory depends on (the settings of) the operating system you are using. The crash is likely to happen when the program tries to malloc a space that is greater than the remaining space of the virtual memory (i.e. main memory + swap space).

authwork commented 3 years ago

It sounds. I will print the key to see when it crashes and do some comparison.

authwork commented 3 years ago

@gvinciguerra Does PGM-index support GC? Is there any suggestions for memory management of using PGM-index. Hash map indeed is very slow for inserting many data.

=========================================== For example, VIRT become 2.5T, while RES is only 250G and cause allocate issue. under the same case with unordered_map, both VIRT and RSE are around 200G and finish the execution.

gvinciguerra commented 3 years ago

This is a C++ question, have a look for example here. I don't think you need this form of memory management. The structure takes responsibility for allocating and deallocating the memory it needs.