gvinciguerra / PGM-index

šŸ…State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
https://pgm.di.unipi.it/
Apache License 2.0
781 stars 92 forks source link

Some keys cannot be found if pgm-index is built in single-thread #44

Closed min-guk closed 1 year ago

min-guk commented 1 year ago

While benchmarking the PGM-Index, I found that the key location could not be found when building with the make_segmentation function instead of the make_segmentation_par function. This issue also occurs in other datasets, but it is most frequent in the eth dataset provided by the GRE benchmark (https://github.com/gre4index/GRE). I've uploaded the test code below, so please check it out.

  1. Download PGM-Index and eth dataset.

    git clone https://github.com/gvinciguerra/PGM-index.git
    cd PGM-index
    wget --no-check-certificate -nc https://www.cse.cuhk.edu.hk/mlsys/gre/eth
  2. Modify the code to build the PGM-Index in a single thread.

    // file: PGM-index/include/pgm/pgm_index.hpp
    
    template<typename RandomIt>
    static void build(RandomIt first, RandomIt last,
                      size_t epsilon, size_t epsilon_recursive,
                      std::vector<Segment> &segments,
                      std::vector<size_t> &levels_offsets) {
        ...
        auto build_level = [&](auto epsilon, auto in_fun, auto out_fun) {
            // auto n_segments = internal::make_segmentation_par(last_n, epsilon, in_fun, out_fun);
            auto n_segments = internal::make_segmentation(last_n, epsilon, in_fun, out_fun);
       ...
  3. Change the test code

    
    #include "catch.hpp"
    #include "pgm/morton_nd.hpp"
    #include "pgm/pgm_index.hpp"
    #include "pgm/pgm_index_dynamic.hpp"
    #include "pgm/pgm_index_variants.hpp"
    #include "pgm/piecewise_linear_model.hpp"
    #include "utils.hpp"

include

include

include

include

include

include

include

include

include

include

include

include

include

// Loads values from binary file into vector. static std::vector load_data(const std::string& filename, bool print = true) { std::vector data;

std::ifstream in(filename, std::ios::binary);
if (!in.is_open()) {
    std::cerr << "unable to open " << filename << std::endl;
    exit(EXIT_FAILURE);
}
// Read size.
uint64_t size;
in.read(reinterpret_cast<char*>(&size), sizeof(uint64_t));
data.resize(size);
// Read values.
in.read(reinterpret_cast<char*>(data.data()), size * sizeof(uint64_t));
in.close();
return data;

}

template <typename Index, typename Data> void test_index(const Index &index, const Data &data) { auto rand = std::bind(std::uniform_int_distribution(0, data.size() - 1), std::mt19937{42});

for (auto i = 1; i <= 10000; ++i) {
    auto q = data[rand()];
    auto range = index.search(q);
    auto lo = data.begin() + range.lo;
    auto hi = data.begin() + range.hi;
    auto k = std::lower_bound(lo, hi, q);
    REQUIRE(*k == q);
}

// Test elements outside range
auto q = data.back() + 42;
auto range = index.search(q);
auto lo = data.begin() + range.lo;
auto hi = data.begin() + range.hi;
REQUIRE(std::lower_bound(lo, hi, q) == data.end());

q = std::numeric_limits<typename Data::value_type>::min();
range = index.search(q);
lo = data.begin() + range.lo;
hi = data.begin() + range.hi;
REQUIRE(std::lower_bound(lo, hi, q) == data.begin());

}

TEMPLATE_TEST_CASE_SIG("PGM-index", "", ((typename T, size_t E1, size_t E2), T, E1, E2), (uint64_t, 8, 4), (uint64_t, 32, 4), (uint64_t, 128, 4), (uint64_t, 256, 256), (uint64_t, 512, 512)) { auto data = load_data("./eth"); pgm::PGMIndex<T, E1, E2> index(data.begin(), data.end()); test_index(index, data); }

4. Build & Run

cmake . -DCMAKE_BUILD_TYPE=Release make -j8 ./test/tests


5. Test result

tests is a Catch v2.13.10 host application.
Run with -? for options

-------------------------------------------------------------------------------
PGM-index - uint64_t, 8, 4
-------------------------------------------------------------------------------
/home/PGM-index/test/tests.cpp:85
...............................................................................

/home/PGM-index/test/tests.cpp:68: FAILED:
  REQUIRE( *k == q )
with expansion:
  99249299 (0x5ea6c93)
  ==
  99249300 (0x5ea6c94)

-------------------------------------------------------------------------------
PGM-index - uint64_t, 32, 4
-------------------------------------------------------------------------------
/home/PGM-index/test/tests.cpp:85
...............................................................................

/home/PGM-index/test/tests.cpp:68: FAILED:
  REQUIRE( *k == q )
with expansion:
  99825789 (0x5f3387d)
  ==
  99825790 (0x5f3387e)

-------------------------------------------------------------------------------
PGM-index - uint64_t, 128, 4
-------------------------------------------------------------------------------
/home/PGM-index/test/tests.cpp:85
...............................................................................

/home/PGM-index/test/tests.cpp:68: FAILED:
  REQUIRE( *k == q )
with expansion:
  100297412 (0x5fa6ac4)
  ==
  100297414 (0x5fa6ac6)

===============================================================================
test cases:     5 |     2 passed | 3 failed
assertions: 21480 | 21477 passed | 3 failed
```
Thank you.
gvinciguerra commented 1 year ago

The last commit 1b062ec should solve this. Thanks