š State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
While benchmarking the PGM-Index, I found that the key location could not be found when building with the make_segmentation function instead of the make_segmentation_par function. This issue also occurs in other datasets, but it is most frequent in the eth dataset provided by the GRE benchmark (https://github.com/gre4index/GRE). I've uploaded the test code below, so please check it out.
Download PGM-Index and eth dataset.
git clone https://github.com/gvinciguerra/PGM-index.git
cd PGM-index
wget --no-check-certificate -nc https://www.cse.cuhk.edu.hk/mlsys/gre/eth
Modify the code to build the PGM-Index in a single thread.
// file: PGM-index/include/pgm/pgm_index.hpp
template<typename RandomIt>
static void build(RandomIt first, RandomIt last,
size_t epsilon, size_t epsilon_recursive,
std::vector<Segment> &segments,
std::vector<size_t> &levels_offsets) {
...
auto build_level = [&](auto epsilon, auto in_fun, auto out_fun) {
// auto n_segments = internal::make_segmentation_par(last_n, epsilon, in_fun, out_fun);
auto n_segments = internal::make_segmentation(last_n, epsilon, in_fun, out_fun);
...
template <typename Index, typename Data>
void test_index(const Index &index, const Data &data) {
auto rand = std::bind(std::uniform_int_distribution(0, data.size() - 1), std::mt19937{42});
for (auto i = 1; i <= 10000; ++i) {
auto q = data[rand()];
auto range = index.search(q);
auto lo = data.begin() + range.lo;
auto hi = data.begin() + range.hi;
auto k = std::lower_bound(lo, hi, q);
REQUIRE(*k == q);
}
// Test elements outside range
auto q = data.back() + 42;
auto range = index.search(q);
auto lo = data.begin() + range.lo;
auto hi = data.begin() + range.hi;
REQUIRE(std::lower_bound(lo, hi, q) == data.end());
q = std::numeric_limits<typename Data::value_type>::min();
range = index.search(q);
lo = data.begin() + range.lo;
hi = data.begin() + range.hi;
REQUIRE(std::lower_bound(lo, hi, q) == data.begin());
While benchmarking the PGM-Index, I found that the key location could not be found when building with the make_segmentation function instead of the make_segmentation_par function. This issue also occurs in other datasets, but it is most frequent in the eth dataset provided by the GRE benchmark (https://github.com/gre4index/GRE). I've uploaded the test code below, so please check it out.
Download PGM-Index and eth dataset.
Modify the code to build the PGM-Index in a single thread.
Change the test code
include
include
include
include
include
include
include
include
include
include
include
include
include
// Loads values from binary file into vector. static std::vector load_data(const std::string& filename,
bool print = true) {
std::vector data;
}
template <typename Index, typename Data> void test_index(const Index &index, const Data &data) { auto rand = std::bind(std::uniform_int_distribution(0, data.size() - 1), std::mt19937{42});
}
TEMPLATE_TEST_CASE_SIG("PGM-index", "", ((typename T, size_t E1, size_t E2), T, E1, E2), (uint64_t, 8, 4), (uint64_t, 32, 4), (uint64_t, 128, 4), (uint64_t, 256, 256), (uint64_t, 512, 512)) { auto data = load_data("./eth"); pgm::PGMIndex<T, E1, E2> index(data.begin(), data.end()); test_index(index, data); }
cmake . -DCMAKE_BUILD_TYPE=Release make -j8 ./test/tests