Open galtay-tempus opened 7 months ago
Does "collection_name" work? https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py#L262
I think the issue is that the current setup creates two tables,
langchain_pg_collection
langchain_pg_embedding
specifying a collection_name that does not exist will add a row to langchain_pg_collection
and vectors will go in langchain_pg_embedding
with a column that specifies the collection (collection_id
). the embedding table has a vector column. if the type is specified w/o size info (e.g. vector
) then you can add vectors of different sizes in different collections. If it is specified with size info (e.g. vector(384)
) then you can create an index but you cant add vectors of different sizes.
I might be misunderstanding something, but I think that if the collections were held in different tables, each with their own embedding column, then you could have different sized indexed vectors in different collections.
and by index I mean (for example) an HNSW index https://github.com/pgvector/pgvector?tab=readme-ov-file#indexing
I'm running into this issue as well.
As I see it we could:
(1) create new embedding tables per collection_name
as galtay-tempus suggested.
or
(2) we could create different embedding_stores based on the embedding_length
specified. For example if no embedding_length
is applied, then langchain_pg_embedding
is created and referenced for this collection. But if embedding_length
is applied, the value is appended to the the embedding_store table name such as langchain_pg_embedding_1536
, langchain_pg_embedding_384
, etc.
I'm honestly not sure which one is better.
According to pgvector under FAQs, you can index columns with multiple vector lengths however each index must consist of rows with the same vector length.
However, you can only create indexes on rows with the same number of dimensions (using expression and partial indexing):
CREATE INDEX ON embeddings USING hnsw ((embedding::vector(3)) vector_l2_ops) WHERE (model_id = 123);
and query with:SELECT * FROM embeddings WHERE model_id = 123 ORDER BY embedding::vector(3) <-> '[3,1,2]' LIMIT 5;
So, alternatively this constraint could be eliminated if, when creating indexes, we did ... WHERE collection_id = <collection uuid>
oh, nice find @patdmob , I'll give that a try!
that does work, good call.
it turns out i'm still getting high latency but now I think its in the way that the metadata is indexed (collection_ids and/or metadata in the jsonb column) ... I have queries that typically have a metadata filter that targets a small fraction of all embeddings in the table but it doesn't appear that the jsonb_path_ops is helping (which wasn't there before IIRC). previously i manually altered the metadata column to be jsonb and indexed a particular value in the metadata and that was really fast ... going to investigate some more ... a few things have changed between the older "non jsonb" implementations and this current one.
created a new issue for the metadata filtering issue https://github.com/langchain-ai/langchain-postgres/issues/34
I hit the same problem and my current solution is to specify vector size (along with language for full-text search) and distance for hnsw index on the schema level.
So, if I need diff vector size, language, or distance, I create a new schema and relevant langchain tables and indices (langchain_pg_collection, langchain_pg_embedding, etc).
It works well, so far.
Curious if anyone was thinking about re-writing the table schemas such that collections map to different tables (and would therefore support indexing different embedding vector sizes) ?
https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py#L258C1-L261C49