lanterndata / lantern

PostgreSQL vector database extension for building AI applications
https://lantern.dev
GNU Affero General Public License v3.0
790 stars 57 forks source link

Postpone index build for empty table to first insert #209

Closed therealdarkknight closed 10 months ago

therealdarkknight commented 1 year ago

Fixes #198

In the event of an index being built on an empty table where no dimension is specified in the options (using CREATE INDEX... WITH (dim=k)), we postpone the creation of this index to the first insertion, where we can get the dimension of the inserted vector and then build the index. This improves user UX as outlined in the issue.

I updated tests, and expanded on the hnsw_create test to test the above functionality.

Notes/Observations:

  1. One thing to consider is what might happen when we have concurrent inserts/calls to aminsert when the index build is postponed. We don't want to trigger the index build more than once accidentally. Perhaps we should pass a Boolean state that marks the progress of a postponed index build and check against it? I have yet to look into Postgres' concurrency model. Feel free to share ideas.

  2. The BuildCallback immediately returns after the first tuple is read (explained in the comment there). This was done for simplicity's sake. But this also means that postgres is pointlessly calling this callback for every other tuple in the heap that exists at that time, since each callback after the first one will immediately return. So a future optimization may be, on postponed builds, to only read the first row and write it to the index without performing the full heap scan with the callback. However, note that this is only relevant in a specific scenario: when we construct an index on an empty table, specify no dimension, and do a bulk insert of some sort as part of the first insert (like \COPY on a csv file), which is when the heap would have more than one tuple at the time of building the postponed index. Since this scenario is pretty rare, we already have a simple implementation, and we are not guaranteed significant performance boosts if we opt for the alternative, I think that the current approach is well worth it.

  3. I noticed that the original codebase, when inferring a dimension (via the GetArrayLengthFromHeap method), we ditch inference entirely upon encountering a NULL vector. This means that we're not able to inference when the first encountered tuple is NULL but there are other non-NULL tuples. Wanted to point this out-- I don't think it would be a hard change: we can just keep going until we find a non-NULL vector or reach the end.

  4. I believe some of the code in GetHnswIndexDimensions can be refactored. It basically tries to do the same thing as InferDimension, and both use GetArrayLength. Idea is to have GetHnswIndexDimensions just call InferDimension and then refactor the InitBuildState as well, since it is currently calling both of these methods under certain circumstances which leads to the same operation effectively being done twice. But the way that GetHnswIndexDimensions and InferDimension use GetArraylength is also slightly different. Is the relevant part of GetHnswIndexDimensions and InferDimension trying to accomplish the same thing? Curious to hear thoughts on this.

var77 commented 1 year ago

Thank you @therealdarkknight ! I was curious, what if we build the index on empty table with let's say 1 dimension and then on the first insert if table is empty update the index header (with exclusive lock) by inferring the dimensions from the current tuple being inserted?

Will there be any problems with that approach or does postponing the index build have more advantages, what do you think?

cc: @Ngalstyan4

Ngalstyan4 commented 1 year ago

I agree with @var77's suggestion. That makes things cleaner.

Let's postpone this for now tho and concentrate on other things first.

Ngalstyan4 commented 10 months ago

Superseded by #251