neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
8.67k stars 573 forks source link

Hybrid search slows down upsert operation #601

Open akset2X opened 9 months ago

akset2X commented 9 months ago

I get to know that delete is not happening on hybrid index. Deleting the embeddings using its "id" is only deleting the embedding list but the parallely created "scoring.terms" and "scoring" remains untouched. This makes /add and /upsert to be very delayed (I guess so).

path: ./index
writable: True

embeddings:
  path: sentence-transformers/all-MiniLM-L6-v2
  content: True
  hybrid: True

I could see that I have 30k plus embedding count available, using /count API. With hybrid index even though search is faster it looks like /upsert is very slow as the data increases. I thought deleting some data would help speed up the upsert, but deleting the embedding doesn't reduce the file size of "scoring" and "scoring.terms" in the actual file directory. Or is there anyway to access the files like documents, scoring.terms and scoring and delete something safely? How can I speed up upsert as the count of documents increase?

davidmezzetti commented 9 months ago

Thank you for the write up. I'll take a look and report back.