tensorchord / pgvecto.rs

Scalable, Low-latency and Hybrid-enabled Vector Search in Postgres. Revolutionize Vector Search, not Database.
https://docs.pgvecto.rs/getting-started/overview.html
Apache License 2.0
1.53k stars 60 forks source link

Question Re. Image Search Query With Distances Annotated As A New Column #501

Closed rsomani95 closed 1 week ago

rsomani95 commented 2 weeks ago

Originally started this discussion in Discord (I think with @VoVAllen) but posting here to keep the discussion more focused: https://discord.com/channels/974584200327991326/1220203706553860227/1245115744224542770

In spirit, what I'm trying to do is a standard vector query with an additional condition where the results are filtered by the cosine distance to the query vector.


Dataset Context

My dataset has 3 tables `collection`, `asset`, and `segment`, setup like so: ```sql CREATE TABLE collection ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL ); CREATE TABLE asset ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), collection_id UUID REFERENCES collection(id) ON DELETE CASCADE, title TEXT NOT NULL ); CREATE TABLE segment ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), asset_id UUID REFERENCES asset(id) ON DELETE CASCADE, embedding vector(640) NOT NULL, -- 640 dimensions category TEXT NOT NULL ); ``` I'm using the `0.3.0-alpha2` docker container for all my tests. I then build some standard indexes on theese tables, followed by the HNSW index. Here is a fully reproducible script to reproduce the dataset: https://gist.github.com/rsomani95/56194e33364208019a09846cb9bacabb#file-build_dataset_with_indexes-sql-L1-L113

Queries

Before running any queries, I've set vectors.hnsw_ef_search=100 globally.

1. Correct results, but no index usage

This query does a brute force search. However, the results are useful because they serve as a ground truth for checking future results for other queries. In my case, I got 96 results (but this will vary because the embeddings are randomised)

WITH distances AS (
    SELECT
        segment.*,
        embedding <=> :embedding AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        segment.category = 'visual'
        AND asset.collection_id = :collection_id
    ORDER BY
        distance
)
SELECT distances.id, distances.distance, distances.category
FROM distances
WHERE distance < 0.99
;

2. Adding LIMIT to use index, but incorrect results

As suggested in the thread here, I added a LIMIT under the ORDER BY clause to activate the index.

Here's the query with the LIMIT added:

WITH distances AS (
    SELECT
        segment.*,
        embedding <=> :embedding AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        segment.category = 'visual'
        AND asset.collection_id = :collection_id
    ORDER BY
        distance
    LIMIT
        100
)
SELECT distances.id, distances.distance, distances.category
FROM distances
WHERE distance < 0.99
;

This does activate the index, but now only returns 1 row as opposed to 96 earlier. I suspect this is because the query is formulated incorrectly; in its current formulation, we must be doing post filtering (I am assuming that the vbase implementation is correct and not the root cause of the issue here)

I tried another approach, adding LIMIT to the bottom of the query:

WITH distances AS (
    SELECT
        segment.*,
        embedding <=> :embedding AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        segment.category = 'visual'
        AND asset.collection_id = :collection_id
    ORDER BY
        distance
)
SELECT distances.id, distances.distance, distances.category
FROM distances
WHERE distance < 0.99
LIMIT 100
;

and while this uses the index, the query runs a lot slower, and still produces incorrect results.


I'd love some guidance on how to formulate the SQL above correctly to use the index and get correct results.

mertalev commented 2 weeks ago

You can get it to use the index and still do pre-filtering with a query like this:

SELECT
    segment.*,
    embedding <=> :embedding AS distance
FROM
    segment
INNER JOIN
    asset ON segment.asset_id = asset.id
WHERE
    distance < 0.99
    AND segment.category = 'visual'
    AND asset.collection_id = :collection_id
ORDER BY
    embedding <=> :embedding
LIMIT 100

This will still be slower than the post-filtering approach because it needs to evaluate the cosine distance for (potentially many) assets in the WHERE condition, but the recall should be correct.

I used embedding <=> :embedding instead of distance for ordering because I remember this being necessary for the index to get used. If the query plan is the same with distance, feel free to use that instead.

VoVAllen commented 2 weeks ago

The last SQL should be the right one. Does it still return only one result? Or multiple results but to be slow? And vector index is an approximate index which means the results might not be the exact same as the exact one (without using the index).'

For your cases, if you don't have too many categories, you can put category_id and vector in one table and create partial index on each category. This will accelerate the filter condition.

VoVAllen commented 2 weeks ago

Unfortunately it's hard to co-optimize the vector search and filter condition together. We have some internal experiments and only a very limited cases can have improvments.

rsomani95 commented 2 weeks ago

@mertalev Have you had success running that exact query? When I tried running it, I get this error saying "distance" is not defined:

ERROR:  column "distance" does not exist
LINE 10:     distance < 0.99

I also tried another variant of what you suggested but ran into the same error:

WITH distances AS (
    SELECT
        segment.*,
        embedding <=> :embedding
            AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        distance < 0.99
        AND segment.category = 'visual'
        AND asset.collection_id = :collection_id
    ORDER BY
        embedding <=> :embedding
)
SELECT distances.id, distances.distance, distances.category
FROM distances
LIMIT 100
;
rsomani95 commented 2 weeks ago

@VoVAllen the last SQL I'd shared in the opening comment does give correct number of results, and yes I'm keeping in mind that the index gives approx matches so the results won't necessarily be identical to brute force. It takes ~16.2s to run

Here's the query plan visualised:

CleanShot 2024-06-14 at 11 28 54@2x

PS: You can click on "Raw" to see the exact query planner output and "Query" for the query that generated it. https://explain.dalibo.com/plan/345ga89gc5edefae

Unfortunately it's hard to co-optimize the vector search and filter condition together. We have some internal experiments and only a very limited cases can have improvments.

I see. I'm slightly confused by this, because I was under the impression that this was one of the promises of the VBASE algorithm? Are there ways you would adjust what my query is trying to achieve to leverage the indexes and vbase correctly that I'm missing out on?

rsomani95 commented 2 weeks ago

Just a thought: are compound indexes with a vector + non-vector column supported? If so, perhaps that could help?

if you don't have too many categories, you can put category_id and vector in one table and create partial index on each category. This will accelerate the filter condition.

Unfortunately, I don't think this will work for our use case

rsomani95 commented 2 weeks ago

Did some further experiments, and we found that returning the distance itself in a new column has negligible cost, and the part of the query responsible for the slowdown is the WHERE distance < 0.99

To be more speicifc, the commented out line is responsible for this query to go from 269ms -> 16200ms:

WITH distances AS (
    SELECT
        segment.id, segment.category,
        embedding <=> '[0.048492431640625,0.004207611083984375,0.020538330078125,-0.06622314453125,-0.009918212890625,0.0819091796875,-0.0247344970703125,-0.01090240478515625,0.0679931640625,-0.040924072265625,-0.0034122467041015625,0.072021484375,0.0008335113525390625,-0.006832122802734375,0.00824737548828125,-0.03314208984375,0.025909423828125,0.04852294921875,-0.0171661376953125,0.0048370361328125,-0.0230712890625,-0.01190185546875,-0.0043792724609375,0.01007080078125,-0.052276611328125,0.0180816650390625,-0.00858306884765625,0.0078582763671875,-0.050811767578125,0.007328033447265625,0.0712890625,0.033416748046875,-0.03240966796875,-0.0029735565185546875,-0.0399169921875,-0.07537841796875,-0.0631103515625,-0.02191162109375,-0.0292510986328125,0.063232421875,0.093017578125,0.022674560546875,-0.0186920166015625,0.0041046142578125,-0.0300140380859375,-0.0941162109375,0.038970947265625,0.036651611328125,-0.0223388671875,0.002079010009765625,0.00818634033203125,-0.0187225341796875,0.031982421875,-0.01050567626953125,-0.00460052490234375,-0.0517578125,-0.0269317626953125,-0.038055419921875,0.007781982421875,0.038482666015625,-0.05145263671875,0.00878143310546875,0.021575927734375,0.053375244140625,0.0005512237548828125,0.03448486328125,-0.01157379150390625,0.0237274169921875,-0.05712890625,0.099365234375,-0.0019664764404296875,0.056243896484375,-0.00019276142120361328,-0.027069091796875,0.1966552734375,-0.01096343994140625,0.00528717041015625,-0.0244140625,-0.0211944580078125,0.0116424560546875,-0.053863525390625,-0.042266845703125,-0.0241546630859375,0.016571044921875,0.01535797119140625,0.0203704833984375,-0.036376953125,0.029815673828125,-0.048858642578125,-0.027984619140625,0.0044097900390625,0.049072265625,-0.06146240234375,0.002292633056640625,-0.01023101806640625,-0.0404052734375,0.03216552734375,-0.0093536376953125,-0.01009368896484375,0.03692626953125,-0.0257568359375,0.0193939208984375,0.004894256591796875,0.005229949951171875,-0.007106781005859375,0.06427001953125,0.007354736328125,0.050048828125,0.05267333984375,0.044952392578125,-0.0908203125,-0.02679443359375,0.01000213623046875,0.0247955322265625,0.0106658935546875,-0.0261688232421875,0.0207672119140625,0.045806884765625,0.035491943359375,-0.014556884765625,0.0421142578125,-0.045654296875,-0.048065185546875,0.0234832763671875,-0.14111328125,-0.004230499267578125,0.024566650390625,-0.0244293212890625,0.00693511962890625,0.01145172119140625,0.040283203125,-0.06060791015625,0.00921630859375,0.029083251953125,-0.09417724609375,0.028656005859375,-0.00201416015625,0.09478759765625,-0.01023101806640625,0.0249786376953125,0.018310546875,-0.02783203125,-0.022979736328125,-0.032073974609375,-0.0009412765502929688,0.00971221923828125,0.0071563720703125,-0.06329345703125,-0.01146697998046875,0.0149078369140625,0.0007505416870117188,0.01476287841796875,-0.0672607421875,-0.05224609375,-0.004695892333984375,0.006256103515625,0.00579833984375,0.0203704833984375,-0.01812744140625,-0.003818511962890625,0.007167816162109375,0.041107177734375,-0.01318359375,0.01480865478515625,-0.05267333984375,-0.0017557144165039062,0.011199951171875,-0.0153656005859375,0.0550537109375,0.03387451171875,-0.0196990966796875,0.033782958984375,-0.0158233642578125,0.0,0.037567138671875,0.0020351409912109375,-0.0206451416015625,-0.0242767333984375,0.037933349609375,0.03271484375,-0.035003662109375,-0.0303802490234375,-0.0282440185546875,0.0411376953125,0.023040771484375,0.183349609375,0.034759521484375,-0.01788330078125,0.07275390625,-0.025360107421875,0.0416259765625,0.037567138671875,-0.04315185546875,-0.06610107421875,0.001766204833984375,0.06866455078125,0.0164642333984375,-0.014373779296875,-0.033905029296875,-0.01079559326171875,-0.03521728515625,0.07550048828125,-0.01229095458984375,0.0278778076171875,0.039764404296875,-0.046722412109375,-0.0186309814453125,0.033660888671875,0.00748443603515625,0.0007295608520507812,-0.0947265625,0.04132080078125,-0.0140380859375,0.01447296142578125,-0.051544189453125,-0.055419921875,0.02691650390625,0.0095062255859375,-0.01568603515625,0.004299163818359375,-0.041656494140625,-0.0280303955078125,0.07684326171875,0.00970458984375,0.04351806640625,-0.078125,-0.0384521484375,0.0222930908203125,-0.003314971923828125,-0.055999755859375,0.01373291015625,0.01485443115234375,-0.04168701171875,0.04571533203125,-0.04693603515625,-0.0166473388671875,-0.072998046875,-0.0240478515625,0.002040863037109375,0.0220794677734375,-0.0009369850158691406,-0.027069091796875,0.04248046875,-0.0750732421875,-0.0960693359375,-0.03765869140625,-0.0173187255859375,-0.0325927734375,0.050750732421875,-0.052520751953125,-0.038818359375,-0.177001953125,-0.0188751220703125,0.0130462646484375,0.04058837890625,0.042236328125,-0.03338623046875,-0.031982421875,0.04229736328125,-0.05474853515625,0.0287017822265625,0.007640838623046875,-0.033447265625,0.0033664703369140625,-0.055694580078125,-0.0714111328125,-0.036346435546875,0.00142669677734375,-0.03070068359375,0.0299530029296875,-0.03369140625,-0.00966644287109375,0.005718231201171875,0.0187225341796875,-0.0213775634765625,-0.0185546875,-0.04180908203125,-0.0450439453125,-0.0229034423828125,0.04693603515625,0.036163330078125,0.061004638671875,-0.0672607421875,0.0009226799011230469,-0.046600341796875,-0.0242767333984375,-0.058624267578125,-0.1256103515625,0.023223876953125,0.0295867919921875,-0.046234130859375,0.0163421630859375,0.050384521484375,-0.025115966796875,0.037994384765625,0.027984619140625,0.0261383056640625,-0.0117645263671875,-0.004352569580078125,0.00960540771484375,0.07293701171875,0.062225341796875,-0.01007080078125,-0.02685546875,-0.0005059242248535156,-0.0122833251953125,-0.024749755859375,0.036529541015625,-0.03192138671875,0.015869140625,0.0777587890625,0.0109405517578125,-0.050018310546875,0.06787109375,0.08477783203125,-0.035736083984375,0.01482391357421875,0.05218505859375,-0.004047393798828125,-0.0165252685546875,0.1019287109375,-0.04510498046875,0.022186279296875,0.005382537841796875,0.004451751708984375,0.01129913330078125,-0.048980712890625,0.047027587890625,-0.0146026611328125,-0.012939453125,-0.0210418701171875,-0.0308990478515625,-0.057220458984375,0.0005764961242675781,0.01168060302734375,-0.004848480224609375,0.01641845703125,-0.0226898193359375,0.0187835693359375,0.1494140625,0.055328369140625,-0.0178985595703125,-0.0169219970703125,-0.043792724609375,-0.0238800048828125,-0.013580322265625,0.0104522705078125,0.0526123046875,-0.033050537109375,0.00775909423828125,-0.025238037109375,0.00724029541015625,-0.025909423828125,-0.00861358642578125,0.040679931640625,0.1619873046875,0.0263519287109375,-0.0229034423828125,0.0101165771484375,-0.041107177734375,-0.0301361083984375,0.006916046142578125,-0.0175323486328125,0.01502227783203125,-0.0161895751953125,-0.01323699951171875,-0.004383087158203125,-0.0087432861328125,-0.0511474609375,0.00356292724609375,0.011383056640625,-0.0548095703125,-0.039947509765625,0.045318603515625,-0.01392364501953125,-0.01788330078125,0.025634765625,0.00868988037109375,-0.00244903564453125,0.0030956268310546875,0.0032634735107421875,0.06744384765625,0.0064697265625,0.005268096923828125,-0.014190673828125,-0.01392364501953125,0.01105499267578125,0.0307769775390625,-0.017425537109375,0.002033233642578125,0.034698486328125,-0.0250244140625,-0.00201416015625,-0.0233154296875,-0.019287109375,0.0181884765625,0.008575439453125,-0.03564453125,0.045379638671875,0.003162384033203125,0.0053863525390625,0.011322021484375,-0.01247406005859375,-0.04608154296875,0.02166748046875,0.0238189697265625,0.022125244140625,0.0035915374755859375,0.022064208984375,0.00432586669921875,-0.03546142578125,0.01241302490234375,0.02740478515625,0.004619598388671875,-0.09027099609375,-0.052764892578125,0.0325927734375,-0.02593994140625,0.0094146728515625,-0.01071929931640625,0.01453399658203125,0.00960540771484375,0.05364990234375,0.003635406494140625,-0.020416259765625,-0.0279083251953125,-0.004726409912109375,-0.007053375244140625,0.004291534423828125,-0.00629425048828125,-0.0200958251953125,-0.049163818359375,0.02899169921875,-0.06494140625,-0.0117034912109375,-0.04913330078125,0.0148468017578125,-0.070556640625,0.04052734375,0.0009059906005859375,0.021759033203125,-0.0352783203125,-0.0341796875,-0.050262451171875,0.00033593177795410156,0.043792724609375,-0.024658203125,0.0182952880859375,-0.01245880126953125,-0.01078033447265625,-0.0157012939453125,0.045257568359375,0.0224609375,-0.0308074951171875,0.06298828125,0.0051727294921875,0.00559234619140625,-0.025604248046875,-0.0242919921875,0.01508331298828125,-0.01375579833984375,0.0262298583984375,-0.043365478515625,-0.032958984375,-0.0101776123046875,-0.0250244140625,-0.0118408203125,0.0208282470703125,-0.006626129150390625,0.047698974609375,0.0027217864990234375,-0.023956298828125,-0.0016622543334960938,0.051513671875,-0.001529693603515625,-0.0250396728515625,-0.0016460418701171875,-0.0206451416015625,0.047576904296875,-0.06134033203125,0.03338623046875,-0.0251922607421875,0.17041015625,0.04351806640625,-0.07464599609375,0.03369140625,1.4543533325195312e-05,-0.0011463165283203125,-0.048004150390625,0.00016164779663085938,0.0247344970703125,-0.0215301513671875,0.03076171875,-0.016937255859375,-0.0943603515625,-0.026824951171875,0.01163482666015625,0.039276123046875,-0.01502227783203125,0.00400543212890625,0.01165771484375,0.0169677734375,-0.033050537109375,-0.03863525390625,-0.09918212890625,0.04339599609375,-0.04229736328125,-0.005825042724609375,-0.030731201171875,-0.0289306640625,0.0068206787109375,0.0158538818359375,0.043853759765625,0.005344390869140625,-0.0231170654296875,0.0287017822265625,0.01654052734375,0.041961669921875,0.005596160888671875,0.0307769775390625,0.0102081298828125,0.03472900390625,0.0164337158203125,0.00878143310546875,0.061492919921875,-0.0435791015625,-0.08380126953125,0.0120391845703125,-0.020660400390625,-0.049346923828125,0.03900146484375,-0.03497314453125,0.00788116455078125,-0.053192138671875,0.0191497802734375,0.02227783203125,0.0333251953125,-0.00919342041015625,-0.104248046875,-0.049774169921875,-0.0645751953125,0.028350830078125,-0.03839111328125,-0.01654052734375,-0.00263214111328125,-0.007259368896484375,0.017120361328125,-0.00391387939453125,-0.0848388671875,0.006214141845703125,-0.0175018310546875,-0.051971435546875,-0.048828125,-0.08807373046875,0.00550079345703125,0.030303955078125,0.00983428955078125,-0.0084686279296875,0.01213836669921875,0.01532745361328125,0.003978729248046875,0.03924560546875,-0.02044677734375,0.0391845703125,-0.051513671875,-0.02532958984375,-0.0233612060546875,0.03936767578125,-0.0286712646484375,-0.037811279296875,0.05712890625,0.0418701171875,0.0294036865234375,0.006137847900390625,-0.0252532958984375,0.003887176513671875,-0.0084991455078125,0.003604888916015625,-0.01824951171875,0.04150390625,-0.08380126953125,-0.00284576416015625,0.01535797119140625,0.0009307861328125,-0.00376129150390625,0.01319122314453125,-0.057586669921875,0.0261383056640625,0.01068878173828125,0.0295867919921875,-0.00908660888671875,-0.0201873779296875,-0.01419830322265625,0.0159149169921875,0.06597900390625,-0.0137176513671875,0.014495849609375,0.01540374755859375,0.01279449462890625,-0.0262298583984375,-0.0182952880859375,0.04644775390625,-0.00547027587890625,-0.0021648406982421875,-0.03570556640625,-0.04400634765625,0.00457000732421875,0.01441192626953125,0.0352783203125,0.043182373046875,-0.0010404586791992188,-0.032196044921875,-0.0079345703125,-0.043792724609375,-0.016204833984375,0.05029296875,0.004558563232421875,-0.055328369140625,-0.0265960693359375,0.028350830078125,0.03192138671875,0.029693603515625,0.040435791015625,0.004878997802734375,-0.034515380859375,0.025146484375,-0.01326751708984375,-0.0140838623046875,-0.0008788108825683594,-0.0230865478515625,0.0088653564453125,0.00522613525390625,-0.0179290771484375,0.0303802490234375,0.032928466796875,-0.0184326171875,0.033935546875,-0.002887725830078125,-0.0090484619140625,0.0176849365234375]'
            AS distance
    FROM
        segment
    INNER JOIN
        asset ON segment.asset_id = asset.id
    WHERE
        segment.category = 'visual'
        AND asset.collection_id = 'ed0bbe8d-ce06-4b49-9e01-0d763439fa8a'
    ORDER BY
        distance
)
SELECT distances.id, distances.distance, distances.category
FROM distances
-- WHERE distances.distance < 0.99
LIMIT 100
;
mertalev commented 2 weeks ago

@rsomani95 Sorry, the earlier one was from memory. The corrected query is this:

SELECT
    segment.*
FROM
    segment
INNER JOIN
    asset ON segment.asset_id = asset.id
WHERE
    embedding <=> :embedding < 0.99
    AND segment.category = 'visual'
    AND asset.collection_id = :collection_id
ORDER BY
    embedding <=> :embedding
LIMIT 100;
rsomani95 commented 1 week ago

@mertalev thank you.

This query seems to have similar results to my last comment, where it takes ~16.2s, and removing the embedding <=> :embedding < 0.99 takes ~375ms

mertalev commented 1 week ago

I think it's because the limit is 100, but there are only 96 results that meet the threshold. It will end up checking every embedding in the table to find another 4 matches. Setting the limit to 96 should be much faster.

I wonder if a way to stop early would help performance here. The ordering should be informative of when there's no longer a point to keep searching.