This PostgreSQL extension provides functionalities for BM25 text queries, generateing BM25 statistic sparse vectors for text. BM25 outperforms dense vector-based retrieval methods in many RAG benchmark tasks.
User can use vector search extensions such as pgvecto.rs
or pgvector
for efficient searches in postgres.
[!IMPORTANT]
Based on our initial tests, HNSW indexing does not support the sparse vectors generated by BM25 very well. The high sparsity prevents effective navigation within the graph.
CREATE EXTENSION pg_bestmatch;
SET search_path TO public, bm_catalog;
bm25_create(table_name, column_name, statistic_name);
. It will create a materilized view to record the stats. bm25_document_to_svector(statistic_name, passage)
bm25_query_to_svector(statistic_name, query)
bert-base-uncased
vocabulary set to tokenize words. Might support more configuration on tokenizer in the future.Here is an example workflow demonstrating the usage of this extension with the example of Stanford LoCo benchmark.
pg_bestmatch
with the dataset.wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Documents/parquet/default/test/0.parquet -O documents.parquet
wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Queries/parquet/default/test/0.parquet -O queries.parquet
import pandas as pd
from sqlalchemy import create_engine
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def adapter_numpy_float64(numpy_float64):
return AsIs(numpy_float64)
def adapter_numpy_int64(numpy_int64):
return AsIs(numpy_int64)
def adapter_numpy_float32(numpy_float32):
return AsIs(numpy_float32)
def adapter_numpy_int32(numpy_int32):
return AsIs(numpy_int32)
def adapter_numpy_array(numpy_array):
return AsIs(tuple(numpy_array))
register_adapter(np.float64, adapter_numpy_float64)
register_adapter(np.int64, adapter_numpy_int64)
register_adapter(np.float32, adapter_numpy_float32)
register_adapter(np.int32, adapter_numpy_int32)
register_adapter(np.ndarray, adapter_numpy_array)
db_url = "postgresql://localhost:5432/pg_bestmatch_test"
engine = create_engine(db_url)
def load_documents():
df = pd.read_parquet("documents.parquet")
df.to_sql("documents", engine, if_exists='replace', index=False)
def load_queries():
df = pd.read_parquet("queries.parquet")
df['answer_pids'] = df['answer_pids'].apply(lambda x: str(x[0]))
df.to_sql("queries", engine, if_exists='replace', index=False)
load_documents()
load_queries()
documents
table.SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 0.75, 1.2);
documents
and queries
tables and update the embeddings for documents and queries.ALTER TABLE documents ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE documents ADD COLUMN embedding sparsevec; -- for pgvector users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage)::svector; -- for pgvecto.rs users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage, 'pgvector')::sparsevec; -- for pgvector users
CREATE INDEX ON documents USING vectors (embedding svector_dot_ops); -- for pgvecto.rs users
CREATE INDEX ON documents USING ivfflat (embedding sparsevec_ip_ops); -- for pgvector users
ALTER TABLE queries ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE queries ADD COLUMN embedding sparsevec; -- for pgvector users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query)::svector; -- for pgvecto.rs users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query, 'pgvector')::sparsevec; -- for pgvector users
SELECT sum((array[answer_pids] = array(SELECT pid FROM documents WHERE queries.dataset = documents.dataset ORDER BY queries.embedding <#> documents.embedding LIMIT 1))::int) FROM queries;
This workflow showcases how to leverage BM25 text queries and vector search in PostgreSQL using this extension. The Top 1 recall of BM25 on this dataset is 0.77
. If you reproduce the result, your operations are correct.
Before building, you should have PostgreSQL
, Rust
and Cargo
installed on your system.
cargo-pgrx
.cargo install cargo-pgrx --version v0.12.0-alpha.1
cargo-pgrx
.cargo pgrx init --pg16=$(which pg_config) # assuming that you have PostgreSQL 16 installed
cargo pgrx install --release # if you want to install it on your machine
cargo pgrx package # if you want to package `pg_bestmatch`
pg_bestmatch.rs
only provides methods for generating sparse vectors and does not support index-based search (which can be achieved by pgvecto.rs or pgvector). pg_search
performs BM25 retrieval via the external tantivy
engine, which may have limitations when combined with transactions, filters, or JOIN operations. Since pg_bestmatch.rs
is entirely native to Postgres, it offers full compatibility with these operations inside postgres.tokenize
SELECT tokenize('i have an apple'); -- result: {i,have,an,apple}
bm25_create
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25');
table_name
: Name of the table.column_name
: Name of the column.stat_name
: Name of the BM25 statistics.b
: BM25 parameter (default 0.75).k
: BM25 parameter (default 1.2).bm25_refresh
SELECT bm25_refresh('documents_passage_bm25');
stat_name
: Name of the BM25 statistics to update.bm25_drop
SELECT bm25_drop('documents_passage_bm25');
stat_name
: Name of the BM25 statistics to delete.bm25_document_to_svector
SELECT bm25_document_to_svector('documents_passage_bm25', 'document_text');
stat_name
: Name of the BM25 statistics.document_text
: The text of the document.style
: Emits pgvecto.rs
-style sparse vector or pgvector
-style sparse vector.bm25_query_to_svector
SELECT bm25_query_to_svector('documents_passage_bm25', 'We begin, as always, with the text.');
stat_name
: Name of the BM25 statistics.query_text
: The text of the query.style
: Emits pgvecto.rs
-style sparse vector or pgvector
-style sparse vector.