kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.4k stars 99 forks source link

Feature: Inbuilt vector index #3778

Open prrao87 opened 4 months ago

prrao87 commented 4 months ago

API

Other

Description

A common use case for graph databases these days (especially in GraphRAG) is to combine vector search with graph traversal to provide more relevant results for a given query. This issue is to track the development of our own internal vector index that can help users build their own vector indices in Kùzu that can help them scale up vector search for larger graphs.

mrdrprofuroboros commented 2 months ago

Hi! Is vector index somewhere on the roadmap? I'm looking for an embeddable alternative to falkor/neo4j for a GraphRAG and kuzu seems like a perfect candidate. I can use it with something like sqlite-vec / lancedb / faiss / annoy, but having vector index directly in kuzu would be a magnitude more convenient

prrao87 commented 2 months ago

@mrdrprofuroboros this is something that's in the research phase and we intend to look at this more in due course as other core database features get in, which is a higher priority. Just out of curiosity, do you intend to use only dense vector search along with graph search, or is hybrid search (incorporating full text search + dense vector search) also relevant? A dedicated vector database is likely far more performant and feature-rich with a lot more usability features than we can provide as a graph database. What are your thoughts on this?

semihsalihoglu-uw commented 2 months ago

Hi @mrdrprofuroboros, just to elaborate a bit more: no one is working on this yet but we have discussed supporting a hnsw-like index in the medium-term. We are slowly understanding the space a bit better but my current feeling is that we will put this into our roadmap, as part of a push we will likely make to enhance Kuzu with indices. We don't yet have any indices other than a default hash index on the nodes. I think we will start discussing indices more seriously in Q4 of this year and we can consider a vector index as one of the first indices to implement depending on the demand we see.

mrdrprofuroboros commented 2 months ago

I see! We’re still experimenting with hybrid search (dense vector + full text), but it feels like we’d be fine with just a decent vector search. I’m designing a system for an embedded device + remote server, where the source of truth is on device, but server has a synced copy as cache and can do more complicated retrieval. But the device still has to be operational without server cache. So far we’ve chosen to use neo4j on server (FalkorDB turned out to have only Euclidean similarity for their vector search which is a very weird decision) but the landscape for embedded vector DBs is quite lame: lancedb is in rust (too complicated for something like zephyros), annoy is on disk but can’t update the index on the fly, faiss is all in memory and requires saving and loading the full index separately, sqlite-vec is pure c and on disk but doesn’t have an index lol. There are others of course but none seem to be on disk + c/c++ + vector index + support index updates. And overall having 2 dbms in one resource constraint device is kinda wasteful

prrao87 commented 2 days ago

Just an update on this to those who are following this issue: @ray6080 and @semihsalihoglu-uw have been discussing this and it's a high priority item for our next phase of development. The Kùzu team has been looking for prioritize Graph RAG applications and having a vector index as well as a full text search index is a key part of this.

The full text search index is much further along and will come out sooner, hopefully Jan 2025.