letoribo commented 1 month ago

when doing like this:

Neo4jVector.from_existing_graph(
    embedding = embeddings,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=["subject", "snippet"],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY
)

it takes 5-6 seconds on 50 nodes, the same result with upsert_vector ~ 55-60 seconds.

did this:

query = """
MATCH (n:MailItem)
RETURN n.snippet AS snippet, n.subject AS subject, elementId(n) AS elementId
"""
records =  driver.execute_query(query)

def result_iterator(records):
    for node in records:
        snippet, subject, elementId = node.values()
        node_properties = f"{subject}\n{snippet}"
        query_vector = embedder.embed_query(node_properties)
        upsert_vector (
            driver,
            node_id=elementId,
            embedding_property="embedding",
            vector=query_vector,
        )

if len(records):
    result_iterator(records)
    return {"info": "Сreated Store"}
else:
    return {"info": "There are no emails in DB"}

also regarding filters - need to return nodes where:

retriever_config={
  "filters": {
    "subject": {"$ilike": query_text},
    "snippet": {"$ilike": query_text},
  }
},

the query text is either in the snippet or in the subject

in https://github.com/neo4j/neo4j-graphrag-python/blob/main/src/neo4j_graphrag/filters.py#L322 if replace:

{OPERATOR_AND: [{k: v} for k, v in filter.items()]}, param_store, node_alias

with {OPERATOR_OR: [{k: v} for k, v in filter.items()]}, param_store, node_alias it works as expected

probably there is a sense to add a parameter that will determine how the filter will work

stellasia commented 1 month ago

Hi @letoribo ,

Regarding filters, "AND" is the default behavior. You can explicitly use an "OR" condition with:

"filters": {
    "$or": [
        {"subject": {"$ilike": query_text}},
        {"snippet": {"$ilike": query_text}},
    ]
  }

See some examples in the documentation.

For you other question, the two approaches you compare do not do the same things:

Neo4jVector.from_existing_graph assumes vectors are already in the graph and only check for the existence of the indexes (and creates them if they do not exist)
upsert_vector pushes new vectors to the DB and add them to an existing index, performing one Cypher query for each (potentially large) vector, which explains why it requires more time.

Hope that helps!

letoribo commented 1 month ago

using

"filters": {
    "$or": [
        {"subject": {"$ilike": query_text}},
        {"snippet": {"$ilike": query_text}},
    ]
  }

the search_query produced by get_search_results is: MATCH (node:MailItem) WHERE node.embeddingIS NOT NULL AND size(node.embedding) = toInteger($embedding_dimension) AND ((toLower(node.subject) CONTAINS $param_0) AND (toLower(node.snippet) CONTAINS $param_1)) WITH node, vector.similarity.cosine(node.embedding, $query_vector) AS score ORDER BY score DESC LIMIT $top_k RETURN node {.snippet, .subject} as node, score

in my example it is: MATCH (node:MailItem) WHERE node.embeddingIS NOT NULL AND size(node.embedding) = toInteger($embedding_dimension) AND ((toLower(node.subject) CONTAINS $param_0) OR (toLower(node.snippet) CONTAINS $param_1)) WITH node, vector.similarity.cosine(node.embedding, $query_vector) AS score ORDER BY score DESC LIMIT $top_k RETURN node {.snippet, .subject} as node, score

letoribo commented 1 month ago

Neo4jVector.from_existing_graphcreates both index and vectors for the properties of interest and takes short time

stellasia commented 1 month ago

Can you share the package version and the call to get_search_query please?

Because, on my dev branch if I use like this:

get_search_query(SearchType.VECTOR, node_label="Label", embedding_node_property="embedding", embedding_dimension=10, filters={
    "$or": [
        {"subject": {"$ilike": "<query_text>"}},
        {"snippet": {"$ilike": "<query_text>"}},
    ]
  })

I get:

'MATCH (node:`Label`) WHERE node.`embedding` IS NOT NULL AND size(node.`embedding`) = toInteger($embedding_dimension) AND ((toLower(node.subject) CONTAINS $param_0) OR (toLower(node.snippet) CONTAINS $param_1)) WITH node, vector.similarity.cosine(node.`embedding`, $query_vector) AS score ORDER BY score DESC LIMIT $top_k RETURN node { .*, `embedding`: null } AS node, labels(node) AS nodeLabels, elementId(node) AS id, score',

which seems correct to me, we have the "OR" operator between the two "CONTAINS", or am I missing something?

letoribo commented 1 month ago

https://github.com/neo4j/neo4j-graphrag-python/blob/main/src/neo4j_graphrag/retrievers/vector.py#L196

stellasia commented 1 month ago

I mean what is your call when you define the filters.

stellasia commented 1 month ago

Try this:

retriever_result = vector_retriever.search(
    query_text=query_text, 
    #query_vector=query_vector[0], 
    top_k=20, 
    filters={"$or": [{"subject": {"$ilike": query_text}}, {"snippet": {"$ilike": query_text}}]} 
)

letoribo commented 1 month ago

Now it works perhaps the restart was unsuccessful @stellasia Thank you

stellasia commented 1 month ago

Closing this issue then as it seems to be solved. Feel free to reopen if this is not the case.

neo4j / neo4j-graphrag-python

On creation vector store and retriever filters #186

{OPERATOR_AND: [{k: v} for k, v in filter.items()]}, param_store, node_alias