tensorchord / pgvecto.rs

Scalable, Low-latency and Hybrid-enabled Vector Search in Postgres. Revolutionize Vector Search, not Database.
https://docs.pgvecto.rs/getting-started/overview.html
Apache License 2.0
1.69k stars 66 forks source link

Can I index array of vectors? #470

Closed javiabellan closed 5 months ago

javiabellan commented 5 months ago

I would like like to do something similar to this, where a table field contains an array of N embeddings

Fashion search example:

the table "image" with the "clothes" column, where each row contains N "clothes" embeddings (dress, shoes, ...). Therefore I can look for the images that contains a given query t-shirt embedding.

I know that postgre supports matrixes (array of arrays) but i don know if we can index an array of vectors in pgvecto.rs

Thanks.

gaocegege commented 5 months ago

Hi, Will your query also look like this?

CREATE TABLE t
(
    `id` Int64,
    `emb` Array(Float32), # Eg. Image embedding
    `several_embs` Array(Array(Float32)) # Eg. Face embeddings
)
ENGINE = MergeTree
ORDER BY id

INSERT INTO t (id, emb, several_embs) VALUES
    (101, [1,2,3], [[1,2,3],[3,2,1]] ),
    (102, [3,2,1], [[3,1,2],[2,1,3],[1,1,3]] )

# RANGE SEARCH (WHERE queries)
SELECT id, score FROM t WHERE cosineDistance(emb,  [2,2,2]) AS score < MaxDistance
SELECT id, score FROM t WHERE cosineDistance(several_embs,  [2,2,2]) AS score < MaxDistance

# TOP-K SEARCH (ORDER BY queries)
SELECT id, score FROM t ORDER BY cosineDistance(emb, [2,2,2]) AS score # This is currently possible
SELECT id, score FROM t ORDER BY cosineDistance(several_embs, [2,2,2]) AS score # This is the requested feature
VoVAllen commented 5 months ago

Hi, you can actually use group by to fulfill your requests.

Here's an example using image and cloth embedding as example, each image can have multiple cloth vectors:

DROP TABLE IF EXISTS image_clothing_embeddings;

CREATE TABLE image_clothing_embeddings (
    image_id INT,                      -- Unique ID for each image
    clothing_item_id INT,              -- Unique ID for each clothing item in an image
    clothing_embedding VECTOR(3) NOT NULL  -- Embedding vector for the clothing item
);

-- Optional: create indices for faster retrieval by image or clothing item
-- CREATE INDEX ON image_clothing_embeddings USING btree (image_id);
-- CREATE INDEX ON image_clothing_embeddings USING btree (clothing_item_id);

-- Create an index on the embedding vectors using a suitable vector operations index type
CREATE INDEX ON image_clothing_embeddings USING vectors (clothing_embedding l2_ops);

-- Inserting random embedding data simulating clothing items in images
INSERT INTO image_clothing_embeddings (image_id, clothing_item_id, clothing_embedding)
    SELECT 
        image, clothing, ARRAY[random(), random(), random()]::real[] 
    FROM
        generate_series(1, 500) AS image,  -- For 500 images
        generate_series(1, 5) AS clothing; -- Assuming each image has 5 pieces of clothing

-- Query to find images with clothing items that closely match the embedding of a query t-shirt
WITH nearest_clothing_items AS (
    SELECT 
        image_id,
        clothing_embedding <-> ARRAY[0.5, 0.5, 0.5]::real[] AS distance  -- Example embedding for the query t-shirt
    FROM 
        image_clothing_embeddings
    ORDER BY
        distance
)
SELECT 
    image_id, 
    MIN(distance) AS closest_match_distance
FROM 
    nearest_clothing_items
GROUP BY 
    image_id
ORDER BY 
    closest_match_distance
LIMIT 10;

-- The execution plan will include sorting and limiting the output to the top 10 images with the closest matching clothing items.
javiabellan commented 5 months ago

Thanks a lot. Separate the data in 2 tables (one for images and other for clothes); and a query-time use a CTE to perform a group by. Magnificent! Thanks again for your awesome library :)

VoVAllen commented 5 months ago

Glad to see it helps!