langchain-ai / langchain-postgres

LangChain abstractions backed by Postgres Backend
MIT License
66 stars 22 forks source link

metadata equality filter loss of performance #34

Open galtay-tempus opened 2 months ago

galtay-tempus commented 2 months ago

in previous versions of the langchain postgres implementation i was able to get sub-second latency on queries that filtered by a string id in the embedding metadata ... something like,

filter = {"some_id": "some_value"}

to do this i was converting the old json cmetadata column into jsonb and adding an index on that particular metadata item.

create index on langchain_pg_embedding((cmetadata->>'some_id'));

in the latest version (with jsonb=True) the latency has gone up about a factor of 10. my initial assumption is that the (new) jsonb_path_ops index is not being used somehow but I still need to investigate more.

galtay-tempus commented 2 months ago

to confirm the speed change i hacked in this command,

filter_clauses = self.EmbeddingStore.cmetadata["some_id"].astext == filter["some_value"]

here (https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py#L911)

which is significantly faster w/o an index than the json_path_ops appears to be. after adding an index like

create index on langchain_pg_embedding((cmetadata->>'some_id'));

query times are back well below 1s (most of the times 0.1s) ... with the current implementation in the repo, query time with filters grows rapidly as the number of rows increases.

a quick fix might be a different code path if all the filter operators are simple equality?