tensorchord / pg_bestmatch.rs

Generate BM25 sparse vector inside PostgreSQL
Apache License 2.0
48 stars 9 forks source link

Unable to create index on svector without dimension #21

Open tucnak opened 1 week ago

tucnak commented 1 week ago

I'm trying to follow the tutorial in README with the most recent pgvecto.rs, however I ran into this issue:

CREATE INDEX ON documents USING vectors (embedding svector_dot_ops);

-- ERROR: pgvecto.rs: Dimensions type modifier of a vector column is needed for building the index.

Are we even supposed to pick dimension for svector and if yes, then how?

jwnz commented 5 days ago

The documentation definitely needs updated, but in the meantime, you can try something like this:

-- create a bm25 matrix
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 'hf', 'google-bert/bert-base-uncased', 0.75, 1.2);

-- convert a string to sparse vector to get the dimension (the number after the '}/'
SELECT bm25_document_to_svector('documents_passage_bm25', 'Some test string');
 -- {24058:0.7689637, 24688:0.7689637, 25455:0.7689637}/28111

-- add embedding column
ALTER TABLE documents ADD COLUMN embedding svector(28111);

-- create index
CREATE INDEX ON documents USING vectors (embedding svector_dot_ops);

-- embed column using specified bm25 matrix
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', documents.passage)::svector

-- Query
-- get the query's vector
SELECT bm25_query_to_svector('documents_passage_bm25', 'Where did Brooklyn Sudano''s mother die?');
-- {2927:0.76834136, 5132:0.76834136, 6102:0.76834136, 8652:0.76834136, 11558:0.76834136, 11560:0.76834136, 18712:0.76834136, 22788:0.76834136, 24841:0.76834136, 27195:0.76834136}/28111

-- find 10 most relevant documents
SELECT d.passage, 1 - (d.embedding <=> '{2927:0.056427535, 5132:0.021093048, 6102:0.045897257, 8652:0.24935675, 11558:0.037319094, 11560:0.15588555, 18712:0.12755758, 22788:0.013146327, 24841:0.26317492, 27195:0.030141948}/28111') as score
FROM documents d
ORDER BY score desc
limit 10;
tucnak commented 5 days ago

Forgive me for I don't exactly follow; from https://blog.pgvecto.rs/pgbestmatchrs-elevate-your-postgresql-text-queries-with-bm25 I was led to believe that pg_bestmatch.rs is a complementary extension in the sense that it introduces BM25 full-text search capability which is a means to hybrid search? I personally found the idea appealing—to use sparse vectors whereas I already use a dense vector type via pgvecto.rs for embeddings.

However, then you speak of google-bert/bert-base-uncased isn't BERT a completely different method altogether? The document vectors pg_bestmatch.rs had generated for me with the README code are all /489 is this a constant of some kind, or how else would I derive it? I couldn't find it by grepping the code.

Perhaps this library is not the solution I thought it would be for implementing hybrid search on top of pgvectors?

VoVAllen commented 5 days ago

@tucnak Hi, can you reproduce the example provided by @jwnz ? This extension partially made BM25 search possible inside postgres, but not an end2end solution now. We're writing a brand-new one trying to solve this in an end2end manner. Hopefully we can have this ready in the mid November.

tucnak commented 5 days ago

Falls apart! I'm working with a Ukrainian dataset, and I do get {...}/1641 for English documents, and {}/1641 i.e. empty for Ukrainian documents. I still don't understand where dimensions are coming from, and why any of this is necessary for BM25 which is a pretty simple statistical model / ranking function is it not?

Perhaps we should just wait for this end-to-end solution you're talking about, or try pg_search in the meantime.

jwnz commented 5 days ago

@tucnak Would you be able to share your SQL? The dimensions come from the number of unique tokens present in the column used to build the BM25 matrix.

VoVAllen commented 5 days ago

@tucnak Can you try the newly added tokenizer like tiktoken o200k? It should be a multi lingual one.

VoVAllen commented 5 days ago
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 'tiktoken', 'o200k_base', 0.75, 1.2);