Closed therealdarkknight closed 10 months ago
Thank you @therealdarkknight ! I was curious, what if we build the index on empty table with let's say 1 dimension and then on the first insert if table is empty update the index header (with exclusive lock) by inferring the dimensions from the current tuple being inserted?
Will there be any problems with that approach or does postponing the index build have more advantages, what do you think?
cc: @Ngalstyan4
I agree with @var77's suggestion. That makes things cleaner.
Let's postpone this for now tho and concentrate on other things first.
Superseded by #251
Fixes #198
In the event of an index being built on an empty table where no dimension is specified in the options (using
CREATE INDEX... WITH (dim=k)
), we postpone the creation of this index to the first insertion, where we can get the dimension of the inserted vector and then build the index. This improves user UX as outlined in the issue.I updated tests, and expanded on the
hnsw_create
test to test the above functionality.Notes/Observations:
One thing to consider is what might happen when we have concurrent inserts/calls to aminsert when the index build is postponed. We don't want to trigger the index build more than once accidentally. Perhaps we should pass a Boolean state that marks the progress of a postponed index build and check against it? I have yet to look into Postgres' concurrency model. Feel free to share ideas.
The
BuildCallback
immediately returns after the first tuple is read (explained in the comment there). This was done for simplicity's sake. But this also means that postgres is pointlessly calling this callback for every other tuple in the heap that exists at that time, since each callback after the first one will immediately return. So a future optimization may be, on postponed builds, to only read the first row and write it to the index without performing the full heap scan with the callback. However, note that this is only relevant in a specific scenario: when we construct an index on an empty table, specify no dimension, and do a bulk insert of some sort as part of the first insert (like\COPY
on acsv
file), which is when the heap would have more than one tuple at the time of building the postponed index. Since this scenario is pretty rare, we already have a simple implementation, and we are not guaranteed significant performance boosts if we opt for the alternative, I think that the current approach is well worth it.I noticed that the original codebase, when inferring a dimension (via the
GetArrayLengthFromHeap
method), we ditch inference entirely upon encountering aNULL
vector. This means that we're not able to inference when the first encountered tuple isNULL
but there are other non-NULL tuples. Wanted to point this out-- I don't think it would be a hard change: we can just keep going until we find a non-NULL vector or reach the end.I believe some of the code in
GetHnswIndexDimensions
can be refactored. It basically tries to do the same thing asInferDimension
, and both useGetArrayLength
. Idea is to haveGetHnswIndexDimensions
just callInferDimension
and then refactor theInitBuildState
as well, since it is currently calling both of these methods under certain circumstances which leads to the same operation effectively being done twice. But the way thatGetHnswIndexDimensions
andInferDimension
useGetArraylength
is also slightly different. Is the relevant part ofGetHnswIndexDimensions
andInferDimension
trying to accomplish the same thing? Curious to hear thoughts on this.