kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.24k stars 86 forks source link

Binder exception: Cannot retrieve child type of type ANY. LIST or ARRAY is expected. #3507

Closed andriichumak closed 1 month ago

andriichumak commented 3 months ago

Hi team. I'm trying to use Kuzu for semantic search using array_cosine_similarity and can't get it working. This looks like a bug to me.

Here is the minimal repro:

import kuzu
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
embedded = model.encode(["test"])[0].tolist()  # list of 384 float numbers

db = kuzu.Database("./demo_db")
conn = kuzu.Connection(db)

conn.execute("CREATE NODE TABLE MyNode(id STRING, embedding DOUBLE[384], PRIMARY KEY(id))")
conn.execute("CREATE (d:MyNode {id: 'test', embedding: $emb})", {"emb": embedded})
response = conn.execute(
    """
    MATCH (d:MyNode)
    RETURN d.id, array_cosine_similarity(d.embedding, $emb)
    """, {"emb": embedded}
)

# Here I get exception
# RuntimeError: Binder exception: Cannot retrieve child type of type ANY. LIST or ARRAY is expected.

Kuzu 0.4.2 Python 3.11.9

prrao87 commented 3 months ago

Hi @andriichumak, yup, it seems like we have to figure out the right way to promote types in this very common scenario for embeddings. Here's the sequence of steps leading to this issue:

Note that in Kùzu, ARRAY is a special case of LIST - the only difference between a LIST and an ARRAY in Kùzu from a user perspective is that the ARRAY has a fixed length that's known beforehand. When the Python list of floats is passed to Kùzu, it's cast to a LIST (which is the correct behaviour for reasons of generality) because Python lists are dynamic in nature and their lengths cannot be assumed to be always fixed.

The workaround here is to perform explicit casting of the embedded variable to the type DOUBLE[384], which transforms the LIST to an ARRAY, and then it works:

# Slightly rephrase the MATCH query
response = conn.execute(
    """
    MATCH (d:MyNode)
    WITH d, CAST($emb, "DOUBLE[384]") AS emb
    RETURN d.id, array_cosine_similarity(d.embedding, emb)
    """, {"emb": embedded}
)

Result:

┌──────┬─────────────────────────────────┐
│ d.id ┆ ARRAY_COSINE_SIMILARITY(d.embe… │
│ ---  ┆ ---                             │
│ str  ┆ f64                             │
╞══════╪═════════════════════════════════╡
│ test ┆ 1.0                             │
└──────┴─────────────────────────────────┘

Potential Improvements

I think this pattern of usage for embeddings is incredibly common though, and it's not ideal that the user has to perform explicit casting in this manner. It's also a little hard to remember the syntax of explicit casting for new users to Kùzu. Maybe we could make some better assumptions about the fact that users will bring in embeddings from Python libraries like sentence-transformers, which are guaranteed to return a fixed-length list for an embedding. So it can be considered "safe" for us to promote the LIST type to ARRAY inside the array_cosine_similarity function?

@andyfengHKU, I think we need to put a bit more thought into this, as I faced a similar issue (as no doubt others will) in #3481.

andriichumak commented 3 months ago

Hey @prrao87. Thanks a lot for a quick feedback. The suggested solution works.

One thing I noticed is that if I define the column type as LIST instead of fixed length array (i.e. DOUBLE[] instead of DOUBLE[384]), it still fails with the same error. Looks like the issue is not that the final execution argument is a list, but rather that it's considered to have the type ANY for some reason.

UPD: OK, I missed the part that array similarity function does not work on lists. Still, the error message is suspicious, I'd assume both ARRAY and LIST should be fine, and the issue is that the typing is lost somewhere along the way.

prrao87 commented 3 months ago

Yup, fully agree. There's something regarding the behaviour we need to change internally to make this easier, because the way most users bring in embeddings into Kùzu is from numpy/python. @andyfengHKU will have some ideas on this. Thanks for reporting!

Update:

I'd assume both ARRAY and LIST should be fine, and the issue is that the typing is lost somewhere along the way.

Well, similarity calculation only works on same-size lists, so at least one of the two lists being compared must be an array. However, the fact that it assumes the internals of the LIST are of type ANY is way too broad (to capture all possibilities), so we need to think about how to cast types internally for this use case without breaking other things.