Discussion about the redis vector DB index algorithm, changed from HNSW to FLAT

opea-project / GenAIComps

GenAI components at micro-service level; GenAI service composer to create mega-service

Apache License 2.0

67 stars 133 forks source link

Discussion about the redis vector DB index algorithm, changed from HNSW to FLAT #840

Open gavinlichn opened 5 days ago

gavinlichn commented 5 days ago

Aware that dataprep/redis/langchain vector DB index algorithm is FLAT. But remembered we use HNSW before.

Investigating the code, it caused by the removing of index_schema, changed with PR #347 If index_schema removed, redis fall back to default index algorith(FLAT)

Considering that may impact the performance.

Can you help to clarify the background of this change please? @Spycsh

Spycsh commented 5 days ago

Hi @gavinlichn , The PR is to remove the hard length limitation and make the vecdb initialization more simple. I do not think explicitly initializing redis with the schema 768 or 1024 is a concise way. With that PR, if users use BGE base, the redis can automatically accept embedding with length 768, otherwise if users use BGE large, the redis can automatically accept embedding with length 1024. Users do not need to know/change the schema length if they use another embedding model.

However as you said, the schema also contains a non-default index algorithm. Could you please give us some data or reason about how is HNSW faster/better than the default ones? Or can we pass a parameter to Redis to change the default indexing algorithm, which I think is more simple?

gavinlichn commented 5 days ago

As the intent of PR is to remove the length limitation. prefer to keep the change clean, and not touch more logic. How about remove the dimension only, and keep other parameters in the previous schema?

For the algorithm comparison, we can refer to Redis' document:

Choose the HNSW index type when you have larger datasets (> 1M documents) or when search performance and scalability are more important than perfect search accuracy
Choose the FLAT index when you have small datasets (< 1M vectors) or when perfect search accuracy is more important than search latency.

https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/#hnsw-index

Spycsh commented 5 days ago

I agree that let users pass a customized schema is reasonable. I think keep the default behaviors (FLAT, accuracy first) now and allow advanced users to set a schema (maybe set the None as default i.e. REDIS_SCHEMA = os.getenv("REDIS_SCHEMA", None) and users can set REDIS_SCHEMA to the path to their own redis yml). What do you think? Do you want to make a PR for this?