gvinciguerra / PGM-index

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
https://pgm.di.unipi.it/
Apache License 2.0
774 stars 91 forks source link

Still, a few keys cannot be found if pgm-index is built in single-thread #45

Closed min-guk closed 1 year ago

min-guk commented 1 year ago

Thank you for resolving the issue I reported previously. It seems like most of the issues I brought up last time have been resolved with the last commit. (https://github.com/gvinciguerra/PGM-index/issues/44)

So, cases where it can't find the key are occurring much less frequently. However, there are still a few cases where it seems unable to find the key.

Could you check the test code below? The codes below are mostly the same as the ones I uploaded on the issue previously, and I've modified the 4th test code to search for all keys in the dataset.

Thanks!

  1. Download PGM-Index and eth dataset.

    git clone https://github.com/gvinciguerra/PGM-index.git
    cd PGM-index
    wget --no-check-certificate -nc https://www.cse.cuhk.edu.hk/mlsys/gre/eth
  2. Modify the code to build the PGM-Index in a single thread.

    // file: PGM-index/include/pgm/pgm_index.hpp
    
    template<typename RandomIt>
    static void build(RandomIt first, RandomIt last,
                      size_t epsilon, size_t epsilon_recursive,
                      std::vector<Segment> &segments,
                      std::vector<size_t> &levels_offsets) {
        ...
        auto build_level = [&](auto epsilon, auto in_fun, auto out_fun) {
            // auto n_segments = internal::make_segmentation_par(last_n, epsilon, in_fun, out_fun);
            auto n_segments = internal::make_segmentation(last_n, epsilon, in_fun, out_fun);
       ...
  3. Change the test code

    
    #include "catch.hpp"
    #include "pgm/morton_nd.hpp"
    #include "pgm/pgm_index.hpp"
    #include "pgm/pgm_index_dynamic.hpp"
    #include "pgm/pgm_index_variants.hpp"
    #include "pgm/piecewise_linear_model.hpp"
    #include "utils.hpp"

include

include

include

include

include

include

include

include

include

include

include

include

include

// Loads values from binary file into vector. static std::vector load_data(const std::string& filename, bool print = true) { std::vector data;

std::ifstream in(filename, std::ios::binary);
if (!in.is_open()) {
    std::cerr << "unable to open " << filename << std::endl;
    exit(EXIT_FAILURE);
}
// Read size.
uint64_t size;
in.read(reinterpret_cast<char*>(&size), sizeof(uint64_t));
data.resize(size);
// Read values.
in.read(reinterpret_cast<char*>(data.data()), size * sizeof(uint64_t));
in.close();
return data;

}

template <typename Index, typename Data> void test_index(const Index &index, const Data &data) { auto rand = std::bind(std::uniform_int_distribution(0, data.size() - 1), std::mt19937{42});

for (auto i = 1; i <  data.size(); ++i) {
    auto q = data[rand()];
    auto range = index.search(q);
    auto lo = data.begin() + range.lo;
    auto hi = data.begin() + range.hi;
    auto k = std::lower_bound(lo, hi, q);
    REQUIRE(*k == q);
}

// Test elements outside range
auto q = data.back() + 42;
auto range = index.search(q);
auto lo = data.begin() + range.lo;
auto hi = data.begin() + range.hi;
REQUIRE(std::lower_bound(lo, hi, q) == data.end());

q = std::numeric_limits<typename Data::value_type>::min();
range = index.search(q);
lo = data.begin() + range.lo;
hi = data.begin() + range.hi;
REQUIRE(std::lower_bound(lo, hi, q) == data.begin());

}

TEMPLATE_TEST_CASE_SIG("PGM-index", "", ((typename T, size_t E1, size_t E2), T, E1, E2), (uint64_t, 512, 4)) { auto data = load_data("./eth"); pgm::PGMIndex<T, E1, E2> index(data.begin(), data.end()); test_index(index, data); }

4. Build & Run

cmake . -DCMAKE_BUILD_TYPE=Release make -j8 ./test/tests


5. Result

tests is a Catch v2.13.10 host application.
Run with -? for options

-------------------------------------------------------------------------------
PGM-index - uint64_t, 512, 4
-------------------------------------------------------------------------------
/PGM-index/test/tests.cpp:70
...............................................................................

/PGM-index/test/tests.cpp:53: FAILED:
  REQUIRE( *k == q )
with expansion:
  119995231 (0x726fb5f)
  ==
  119995229 (0x726fb5d)

===============================================================================
test cases:       1 |       0 passed | 1 failed
assertions: 1618731 | 1618730 passed | 1 failed
```
min-guk commented 1 year ago

From my personal experience running multiple benchmarks, it seems that issues arise when the dataset is too easy or when the maximum error is extremely large. Problems seem to occur due to floating point operations when a single segment handles a large number of keys and the slope value decreases.

I also referred to https://github.com/gvinciguerra/PGM-index/issues/30, and it appears that using fesetround(FE_DOWNWARD); leads to cases where keys cannot be found at different epsilon values.

Is there a way to prevent this issue? Thank you.

gvinciguerra commented 1 year ago

There is just one segment out of 145 in the pgm::PGMIndex<uint64_t, 512, 4, float> for that eth dataset whose slope requires more precision than float. A quick fix here would be to change the 4th template argument to double, or simply to keep track of the increased error (just +2)

min-guk commented 1 year ago

In this case, it would be good to solve the problem by using the double data type. I really appreciate your kind responses to my questions every time. :smile:

gvinciguerra commented 1 year ago

In this case, it would be good to solve the problem by using the double data type.

Yes, it's the simplest option since it seems to happen just in this particular dataset + epsilon + segment. (A more involved solution would be to split/stop the segment when it needs more precision than the given floating point type.)

I really appreciate your kind responses to my questions every time. 😄

Sure!