langchain-ai / langchain-postgres

LangChain abstractions backed by Postgres Backend
MIT License
133 stars 48 forks source link

Rows returned from similarity_search with filter less than expected #93

Open CrazyShipOne opened 4 months ago

CrazyShipOne commented 4 months ago

Hi, thanks in advance for this amazing project!

The problem I'm facing is rows returned from function similarity_search always less than expected. For example, I have 100 rows match filter's condition in table _langchain_pgembedding, by executing

vector_store = PGVector(
            connection=db_str,
            embeddings = embedding_model,
            distance_strategy=DistanceStrategy.EUCLIDEAN
        )
params = {'query':'query_context', 'k':100, 'filter':{'user_name':'user_1'}}
docs = self._vector_store.similarity_search(**params)

I expected length of docs should be 100. However, per different query_context, length of docs are various but less than 100 all the time, like 43 or 67.

So I enabled log for sqlchemy.engine, and got SQL it generated here:

2024-07-25 20:21:47,269 - sqlalchemy.engine.Engine - INFO - SELECT langchain_pg_embedding.id AS langchain_pg_embedding_id, langchain_pg_embedding.collection_id AS langchain_pg_embedding_collection_id, langchain_pg_embedding.embedding AS langchain_pg_embedding_embedding, langchain_pg_embedding.document AS langchain_pg_embedding_document, langchain_pg_embedding.cmetadata AS langchain_pg_embedding_cmetadata, langchain_pg_embedding.embedding <-> %(embedding_1)s AS distance
FROM langchain_pg_embedding JOIN langchain_pg_collection ON langchain_pg_embedding.collection_id = langchain_pg_collection.uuid
WHERE langchain_pg_embedding.collection_id = %(collection_id_1)s::UUID AND jsonb_path_match(langchain_pg_embedding.cmetadata, CAST(%(param_1)s AS JSONPATH), CAST(%(param_2)s::JSONB AS JSONB)) ORDER BY distance ASC
 LIMIT %(param_3)s::INTEGER
2024-07-25 20:21:47,269 - sqlalchemy.engine.Engine - INFO - [generated in 0.00126s] {'embedding_1': '[0.002203210722655058,-0.0034537874162197113,-0.04540061205625534,0.016271447762846947,0.03849130868911743,0.035241544246673584,0.013832494616508484, ... (21402 characters truncated) ... .03421637415885925,-0.03275654464960098,-0.0023437547497451305,0.0021014492958784103,-0.01060000341385603,0.008099209517240524,0.0024598073214292526]', 'collection_id_1': UUID('24b25f49-4e90-4073-9b4b-f5fd91b83c61'), 'param_1': '$.user_name == $value', 'param_2': Jsonb({'value': 'user_1'}), 'param_3': 100}

After replace parameters into sql and execute it in pgAdmin, 100 rows are returned.

Then I dig into code, tried to comment filter in sqlalchemy query vectorstores.py

results: List[Any] = (
                session.query(
                    self.EmbeddingStore,
                    self.distance_strategy(embedding).label("distance"),
                )
                #This line is commented .filter(*filter_by) 
                .order_by(sqlalchemy.asc("distance"))
                .join(
                    self.CollectionStore,
                    self.EmbeddingStore.collection_id == self.CollectionStore.uuid,
                )
                .limit(k)
                .all()
            )

All 100 rows are returned.

I'm not familiar with sqlalchemy so no clues at all. Could you please take a look on what's happening here? Am I using the search function in wrong way?

Thanks!

eyurtsev commented 4 months ago

Could you share which query you executed in SQL?