gvinciguerra / PGM-index

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
https://pgm.di.unipi.it/
Apache License 2.0
784 stars 92 forks source link

BUG: Wrong search result range for a special test case #50

Closed yangzq50 closed 6 months ago

yangzq50 commented 6 months ago

The following GoogleTest code will not pass:

class TestPGM : public BaseTest {};

using u32 = uint32_t;
using i32 = int32_t;

constexpr u32 TestNum = 100;

TEST_F(TestPGM, TestPGMI32) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<i32> distrib(std::numeric_limits<i32>::lowest(), std::numeric_limits<i32>::max());

    std::vector<i32> data(TestNum);
    std::generate(data.begin(), data.end(), [&] { return distrib(gen); });
    std::sort(data.begin(), data.end());
    PGMIndex<i32> pgm(data);
    for (u32 i = 0; i < TestNum; ++i) {
        auto [pos, lo, hi] = pgm.search(data[i]);
        EXPECT_LE(lo, i);
        EXPECT_GE(hi, i);
    }
}

This bug may be related to issue #49 ?

If I convert data to int64_t type, the search result will all be correct. For example, following GoogleTest code will pass:

class TestPGM : public BaseTest {};

using u32 = uint32_t;
using i32 = int32_t;
using i64 = int64_t;

constexpr u32 TestNum = 100;

TEST_F(TestPGM, TestPGMI32) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<i32> distrib(std::numeric_limits<i32>::lowest(), std::numeric_limits<i32>::max());

    std::vector<i64> data(TestNum);
    std::generate(data.begin(), data.end(), [&] { return distrib(gen); });
    std::sort(data.begin(), data.end());
    PGMIndex<i64> pgm(data);
    for (u32 i = 0; i < TestNum; ++i) {
        auto [pos, lo, hi] = pgm.search(data[i]);
        EXPECT_LE(lo, i);
        EXPECT_GE(hi, i);
    }
}
gvinciguerra commented 6 months ago

Thanks, it should be fixed now