kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.38k stars 97 forks source link

Feature Request: Vector Operations #3068

Closed Btibert3 closed 7 months ago

Btibert3 commented 7 months ago

I just came across this project, and wow I am impressed. Currently, Neo4j supports vector operations, more specifically, similarity calculations. It would be great if we could extend the concept of fixed-length lists and perform similarity operations. Maybe that support already exists and I am overlooking how to achieve this with your stack, but this would be a great feature to help support RAG operations.

semihsalihoglu-uw commented 7 months ago

Thanks for raising this. I agree that we should support some of the common functions. We can follow DuckDB's array functions: https://duckdb.org/docs/sql/data_types/array.html#functions. Arrays in DuckDB are equivalent to our fixed-length list type, so I don't think there is a hurdle in supporting these.

I'll put this into our pipeline.

prrao87 commented 7 months ago

Hi @Btibert3 that makes a lot of sense. Could you elaborate a bit on what the intended use case is in the context of RAG? To you, how would the ideal implementation look from a graph query perspective?

Btibert3 commented 7 months ago

@semihsalihoglu-uw DuckDB is exactly what I had in my head.

@prrao87 Naive RAG lets you find entries (i.e. nodes) based on the similarity of the vector to the input query. We can go beyond this by further restricting the results by leveraging graph patterns. One example might be to show a list of products based on the user's input query (vector similarity) but further restrict/re-rank the results based on products the user hasn't purchased and behavior of other "similar" users, where similarity in this context is leveraging graph relationships. In this example, the results come from vector-based similarity and graph relationships.

Another example would be to consider the most similar document chunk via vector search, but improving context windows based on linked nodes and variable pattern matching, again using vector similarity but also the structure of the relationships in the graph.

acquamarin commented 7 months ago

@Btibert3 May i know which vector operations you are most interested in? So we can implement those in advance.

Btibert3 commented 7 months ago

Sure thing.

In short I believe that you can go pretty far with those three.

prrao87 commented 7 months ago

@acquamarin I'd start with cosine and then extend to Euclidean (L2) and then finally dot product, in that order. Cosine seems to be the most common metric used for similarity search in general.

Btibert3 commented 7 months ago

If those are the two being considered out of the gate, I completely agree with cosine.

Btibert3 commented 7 months ago

Wow! Very impressed.

hpvd commented 7 months ago

one week from request to implementation? Just unbelievable :-D Many thanks for your work!