langchain-ai / langchain-postgres

LangChain abstractions backed by Postgres Backend
MIT License
135 stars 48 forks source link

Potential solutions for being able to index different vector sizes? #26

Open galtay-tempus opened 7 months ago

galtay-tempus commented 7 months ago

Curious if anyone was thinking about re-writing the table schemas such that collections map to different tables (and would therefore support indexing different embedding vector sizes) ?

  embedding_length: The length of the embedding vector. (default: None)
      NOTE: This is not mandatory. Defining it will prevent vectors of
      any other size to be added to the embeddings table but, without it,
      the embeddings can't be indexed.

https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py#L258C1-L261C49

eyurtsev commented 7 months ago

Does "collection_name" work? https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py#L262

galtay-tempus commented 7 months ago

I think the issue is that the current setup creates two tables,

specifying a collection_name that does not exist will add a row to langchain_pg_collection and vectors will go in langchain_pg_embedding with a column that specifies the collection (collection_id). the embedding table has a vector column. if the type is specified w/o size info (e.g. vector) then you can add vectors of different sizes in different collections. If it is specified with size info (e.g. vector(384)) then you can create an index but you cant add vectors of different sizes.

I might be misunderstanding something, but I think that if the collections were held in different tables, each with their own embedding column, then you could have different sized indexed vectors in different collections.

and by index I mean (for example) an HNSW index https://github.com/pgvector/pgvector?tab=readme-ov-file#indexing

patdmob commented 7 months ago

I'm running into this issue as well.

As I see it we could: (1) create new embedding tables per collection_name as galtay-tempus suggested.
or
(2) we could create different embedding_stores based on the embedding_length specified. For example if no embedding_length is applied, then langchain_pg_embedding is created and referenced for this collection. But if embedding_length is applied, the value is appended to the the embedding_store table name such as langchain_pg_embedding_1536, langchain_pg_embedding_384, etc.

I'm honestly not sure which one is better.

patdmob commented 7 months ago

According to pgvector under FAQs, you can index columns with multiple vector lengths however each index must consist of rows with the same vector length.

However, you can only create indexes on rows with the same number of dimensions (using expression and partial indexing): CREATE INDEX ON embeddings USING hnsw ((embedding::vector(3)) vector_l2_ops) WHERE (model_id = 123); and query with: SELECT * FROM embeddings WHERE model_id = 123 ORDER BY embedding::vector(3) <-> '[3,1,2]' LIMIT 5;

So, alternatively this constraint could be eliminated if, when creating indexes, we did ... WHERE collection_id = <collection uuid>

galtay-tempus commented 7 months ago

oh, nice find @patdmob , I'll give that a try!

galtay-tempus commented 7 months ago

that does work, good call.

it turns out i'm still getting high latency but now I think its in the way that the metadata is indexed (collection_ids and/or metadata in the jsonb column) ... I have queries that typically have a metadata filter that targets a small fraction of all embeddings in the table but it doesn't appear that the jsonb_path_ops is helping (which wasn't there before IIRC). previously i manually altered the metadata column to be jsonb and indexed a particular value in the metadata and that was really fast ... going to investigate some more ... a few things have changed between the older "non jsonb" implementations and this current one.

galtay-tempus commented 7 months ago

created a new issue for the metadata filtering issue https://github.com/langchain-ai/langchain-postgres/issues/34

magaton commented 5 months ago

I hit the same problem and my current solution is to specify vector size (along with language for full-text search) and distance for hnsw index on the schema level.

So, if I need diff vector size, language, or distance, I create a new schema and relevant langchain tables and indices (langchain_pg_collection, langchain_pg_embedding, etc).

It works well, so far.