neo4j / neo4j-graphrag-python

Neo4j GraphRAG for Python
https://neo4j.com/docs/neo4j-graphrag-python/current/
Other
231 stars 37 forks source link

On creation vector store and retriever filters #186

Closed letoribo closed 1 month ago

letoribo commented 1 month ago

when doing like this:

Neo4jVector.from_existing_graph(
    embedding = embeddings,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=["subject", "snippet"],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY
)

it takes 5-6 seconds on 50 nodes, the same result with upsert_vector ~ 55-60 seconds.

did this:

query = """
MATCH (n:MailItem)
RETURN n.snippet AS snippet, n.subject AS subject, elementId(n) AS elementId
"""
records =  driver.execute_query(query)

def result_iterator(records):
    for node in records:
        snippet, subject, elementId = node.values()
        node_properties = f"{subject}\n{snippet}"
        query_vector = embedder.embed_query(node_properties)
        upsert_vector (
            driver,
            node_id=elementId,
            embedding_property="embedding",
            vector=query_vector,
        )

if len(records):
    result_iterator(records)
    return {"info": "Сreated Store"}
else:
    return {"info": "There are no emails in DB"}

image

also regarding filters - need to return nodes where:

retriever_config={
  "filters": {
    "subject": {"$ilike": query_text},
    "snippet": {"$ilike": query_text},
  }
},

the query text is either in the snippet or in the subject

in https://github.com/neo4j/neo4j-graphrag-python/blob/main/src/neo4j_graphrag/filters.py#L322 if replace:

{OPERATOR_AND: [{k: v} for k, v in filter.items()]}, param_store, node_alias

with {OPERATOR_OR: [{k: v} for k, v in filter.items()]}, param_store, node_alias it works as expected

probably there is a sense to add a parameter that will determine how the filter will work

stellasia commented 1 month ago

Hi @letoribo ,

Regarding filters, "AND" is the default behavior. You can explicitly use an "OR" condition with:

"filters": {
    "$or": [
        {"subject": {"$ilike": query_text}},
        {"snippet": {"$ilike": query_text}},
    ]
  }

See some examples in the documentation.

For you other question, the two approaches you compare do not do the same things:

Hope that helps!

letoribo commented 1 month ago

using

"filters": {
    "$or": [
        {"subject": {"$ilike": query_text}},
        {"snippet": {"$ilike": query_text}},
    ]
  }

the search_query produced by get_search_results is: MATCH (node:MailItem) WHERE node.embeddingIS NOT NULL AND size(node.embedding) = toInteger($embedding_dimension) AND ((toLower(node.subject) CONTAINS $param_0) AND (toLower(node.snippet) CONTAINS $param_1)) WITH node, vector.similarity.cosine(node.embedding, $query_vector) AS score ORDER BY score DESC LIMIT $top_k RETURN node {.snippet, .subject} as node, score

in my example it is: MATCH (node:MailItem) WHERE node.embeddingIS NOT NULL AND size(node.embedding) = toInteger($embedding_dimension) AND ((toLower(node.subject) CONTAINS $param_0) OR (toLower(node.snippet) CONTAINS $param_1)) WITH node, vector.similarity.cosine(node.embedding, $query_vector) AS score ORDER BY score DESC LIMIT $top_k RETURN node {.snippet, .subject} as node, score

letoribo commented 1 month ago

Neo4jVector.from_existing_graphcreates both index and vectors for the properties of interest and takes short time

stellasia commented 1 month ago

Can you share the package version and the call to get_search_query please?

Because, on my dev branch if I use like this:

get_search_query(SearchType.VECTOR, node_label="Label", embedding_node_property="embedding", embedding_dimension=10, filters={
    "$or": [
        {"subject": {"$ilike": "<query_text>"}},
        {"snippet": {"$ilike": "<query_text>"}},
    ]
  })

I get:

'MATCH (node:`Label`) WHERE node.`embedding` IS NOT NULL AND size(node.`embedding`) = toInteger($embedding_dimension) AND ((toLower(node.subject) CONTAINS $param_0) OR (toLower(node.snippet) CONTAINS $param_1)) WITH node, vector.similarity.cosine(node.`embedding`, $query_vector) AS score ORDER BY score DESC LIMIT $top_k RETURN node { .*, `embedding`: null } AS node, labels(node) AS nodeLabels, elementId(node) AS id, score',

which seems correct to me, we have the "OR" operator between the two "CONTAINS", or am I missing something?

letoribo commented 1 month ago

image

letoribo commented 1 month ago

https://github.com/neo4j/neo4j-graphrag-python/blob/main/src/neo4j_graphrag/retrievers/vector.py#L196

stellasia commented 1 month ago

I mean what is your call when you define the filters.

stellasia commented 1 month ago

Try this:

retriever_result = vector_retriever.search(
    query_text=query_text, 
    #query_vector=query_vector[0], 
    top_k=20, 
    filters={"$or": [{"subject": {"$ilike": query_text}}, {"snippet": {"$ilike": query_text}}]} 
)
letoribo commented 1 month ago

image

Now it works perhaps the restart was unsuccessful @stellasia Thank you

stellasia commented 1 month ago

Closing this issue then as it seems to be solved. Feel free to reopen if this is not the case.