Closed javiabellan closed 5 months ago
Hi, Will your query also look like this?
CREATE TABLE t
(
`id` Int64,
`emb` Array(Float32), # Eg. Image embedding
`several_embs` Array(Array(Float32)) # Eg. Face embeddings
)
ENGINE = MergeTree
ORDER BY id
INSERT INTO t (id, emb, several_embs) VALUES
(101, [1,2,3], [[1,2,3],[3,2,1]] ),
(102, [3,2,1], [[3,1,2],[2,1,3],[1,1,3]] )
# RANGE SEARCH (WHERE queries)
SELECT id, score FROM t WHERE cosineDistance(emb, [2,2,2]) AS score < MaxDistance
SELECT id, score FROM t WHERE cosineDistance(several_embs, [2,2,2]) AS score < MaxDistance
# TOP-K SEARCH (ORDER BY queries)
SELECT id, score FROM t ORDER BY cosineDistance(emb, [2,2,2]) AS score # This is currently possible
SELECT id, score FROM t ORDER BY cosineDistance(several_embs, [2,2,2]) AS score # This is the requested feature
Hi, you can actually use group by to fulfill your requests.
Here's an example using image and cloth embedding as example, each image can have multiple cloth vectors:
DROP TABLE IF EXISTS image_clothing_embeddings;
CREATE TABLE image_clothing_embeddings (
image_id INT, -- Unique ID for each image
clothing_item_id INT, -- Unique ID for each clothing item in an image
clothing_embedding VECTOR(3) NOT NULL -- Embedding vector for the clothing item
);
-- Optional: create indices for faster retrieval by image or clothing item
-- CREATE INDEX ON image_clothing_embeddings USING btree (image_id);
-- CREATE INDEX ON image_clothing_embeddings USING btree (clothing_item_id);
-- Create an index on the embedding vectors using a suitable vector operations index type
CREATE INDEX ON image_clothing_embeddings USING vectors (clothing_embedding l2_ops);
-- Inserting random embedding data simulating clothing items in images
INSERT INTO image_clothing_embeddings (image_id, clothing_item_id, clothing_embedding)
SELECT
image, clothing, ARRAY[random(), random(), random()]::real[]
FROM
generate_series(1, 500) AS image, -- For 500 images
generate_series(1, 5) AS clothing; -- Assuming each image has 5 pieces of clothing
-- Query to find images with clothing items that closely match the embedding of a query t-shirt
WITH nearest_clothing_items AS (
SELECT
image_id,
clothing_embedding <-> ARRAY[0.5, 0.5, 0.5]::real[] AS distance -- Example embedding for the query t-shirt
FROM
image_clothing_embeddings
ORDER BY
distance
)
SELECT
image_id,
MIN(distance) AS closest_match_distance
FROM
nearest_clothing_items
GROUP BY
image_id
ORDER BY
closest_match_distance
LIMIT 10;
-- The execution plan will include sorting and limiting the output to the top 10 images with the closest matching clothing items.
Thanks a lot. Separate the data in 2 tables (one for images and other for clothes); and a query-time use a CTE to perform a group by. Magnificent! Thanks again for your awesome library :)
Glad to see it helps!
I would like like to do something similar to this, where a table field contains an array of N embeddings
Fashion search example:
the table "image" with the "clothes" column, where each row contains N "clothes" embeddings (dress, shoes, ...). Therefore I can look for the images that contains a given query t-shirt embedding.
I know that postgre supports matrixes (array of arrays) but i don know if we can index an array of vectors in pgvecto.rs
Thanks.