lanterndata / lantern

PostgreSQL vector database extension for building AI applications
https://lantern.dev
GNU Affero General Public License v3.0
790 stars 57 forks source link

Dynamically scale usearch index on construction #122

Closed ezra-varady closed 1 year ago

ezra-varady commented 1 year ago

This replaces the fixed size allocation in BuildIndex with a scaling during tuple insertion. It alters cost estimates slightly, but not significantly

var77 commented 1 year ago

Hi @ezra-varady thanks for the improvement! I have some idea about the initial index size reservation. Maybe we can estimate the row count and set the initial index size based on that? I wrote a simple PoC, and tested on 1-6k data, seems to work okay, but I am not sure about the bigger data size and the costs of this operations.

    BlockNumber numBlocks = RelationGetNumberOfBlocks(heap);
    uint32_t    estimated_row_count = 0;
    if(numBlocks > 0) {
        // Read the first block
        Buffer buffer = ReadBufferExtended(heap, MAIN_FORKNUM, 0, RBM_NORMAL, NULL);
        // Lock buffer so there won't be any new writes during this operation
        LockBuffer(buffer, BUFFER_LOCK_SHARE);
        // This is like converting block buffer to Page struct
        Page         page = BufferGetPage(buffer);
        // Getting the maximum tuple index on the page
        OffsetNumber offset = PageGetMaxOffsetNumber(page);
        // Estimating the row count in the table
        // There can be 3 cases
        // 1 - new data is added and numBlocks gets increased. In this case the estimation will be lower than actual number (we need to check and increase index size)
        // 2 - the last page is not fully associated (this is most likely to happen). In this case we will have a bit higher estimated number
        // 3 - the last page is fully associated and we get exactly the number of rows that the table has (this is very rare case I think)
        estimated_row_count = offset * numBlocks;
        // Unlock and release buffer
        LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
        ReleaseBuffer(buffer);
    }

cc: @Ngalstyan4

ezra-varady commented 1 year ago

Good idea! Let me incorporate this

ezra-varady commented 1 year ago

Fixed the issues mentioned above. @var77 I'm not sure, can you say more? Also, I think that Varik's code is accurate enough that we never have to resize after the first guess in the current tests so we may want to find a way to force this case

var77 commented 1 year ago

Fixed the issues mentioned above. @var77 I'm not sure, can you say more? Also, I think that Varik's code is accurate enough that we never have to resize after the first guess in the current tests so we may want to find a way to force this case

I was thinking that if the estimations will be accurate, we may be off at max 1 page, but just talked with Narek and seems the TOAST-ed ~rows~ columns may work differently, so we should check how the estimations will behave on there. I will try to read about TOAST columns and do some tests during the day.