run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.84k stars 5.08k forks source link

[Bug]: Neo4j Vector query returns Empty response #12375

Open JPonsa opened 6 months ago

JPonsa commented 6 months ago

Bug Description

llama-index (0.10.25) llama-index-vector-stores-neo4jvector (0.1.3) llama-index-embeddings-huggingface (0.1.4)

I have a KG with embeddings as node properties. Not able to obtain a response when querying against the KG. It must be either a bug or a user error configuring it in llama-index. I have a working example in langchain.

I would like to use llamaindex over langchain because I am getting better results in txt2SQL with llamaindex than langchain. I would like to use only one of the 2 for the entirety of the project, reducing the number of dependencies.

@tomasonjo I feel this could land on your plate. Also, I noticed that langchain has more functionalities than llama-index regarding neo4j vector search. E.g. I could not see how to do a similarity_search_with_score in llama-index. Maybe it is me, I find harder to follow the documentation in llama-index. It would be great if we could have feature parity. Sorry. I know it must be hard to keep track of langchain, llama-index and dspy.

BTW, I would love a KG-RAG example not based on documents but querying a KG that to no relief on cypher. E.g. Fiding relevant notes based on vector index. Expanding the search to linked nodes (n relations). Pass the KG and the note attributes to the LLM.

Version

0.10.25

Steps to Reproduce


username = "tester"
password  = "password"
url = "bolt://localhost:7687"
database="ctgov"
node_label= "AdverseEvent", 
embedding_node_property="biobert_emb"
index_name = "adverse_event"
text_node_properties = ["term","organ_system"]
user_query = "Anaemia"

### Using Llama-index ###
from llama_index.llms.ollama import Ollama
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding as LI_HF

from llama_index.vector_stores.neo4jvector import Neo4jVectorStore
from llama_index.core import VectorStoreIndex

llm = Ollama(model="mistral", request_timeout=30.0)
li_biobert = LI_HF(model_name="dmis-lab/biobert-base-cased-v1.1")

Settings.llm = llm
Settings.embed_model = li_biobert

neo4j_vector = Neo4jVectorStore(
    username,
    password,
    url,
    768,
    node_label= node_label, 
    embedding_node_property= embedding_node_property, 
    text_node_property= text_node_properties[0],
    index_name=index_name,
    hybrid_search=True
)

loaded_index = VectorStoreIndex.from_vector_store(neo4j_vector).as_query_engine()
test= loaded_index.query(user_query)
print(test)

# output
 Empty Response

### Using Langchain ###
from langchain_community.vectorstores import Neo4jVector
from langchain_community.embeddings import HuggingFaceEmbeddings as LC_HF

lc_biobert = LC_HF(model_name="dmis-lab/biobert-base-cased-v1.1")

adverse_event = Neo4jVector.from_existing_graph(
    username=username, 
    password=password, 
    database=database,
    url=url,
    node_label= node_label, 
    embedding_node_property= embedding_node_property, 
    text_node_properties= text_node_properties,
    index_name=index_name,
    search_type="hybrid",
    keyword_index_name= "adverse_event_kw",
    embedding= lc_biobert,
    )

test = adverse_event.similarity_search_with_score(user_query, k=3)
print(test)

# output
[(Document(page_content='\nterm: Anaemia\norgan_system: Blood and lymphatic system disorders', metadata={'stats': ...

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 6 months ago

Hey there @JPonsa! 🌟 Oh, it's great to see you back here! It really feels like we're becoming quite the troubleshooting duo, doesn't it? 😄 I'm diving into your issue right now, so hang tight - I'll have a full answer ready for you shortly!

tomasonjo commented 6 months ago

Hard to say what's wrong. You probably want to pass keyword_index_name in llamaindex as well.

As for returning score with results, that's more of a questions for LlamaIndex folks. @logan-markewich

dosubot[bot] commented 6 months ago

It seems there are several potential reasons for receiving an empty response when querying your Knowledge Graph using LlamaIndex. Here's a breakdown of possible causes and their solutions:

  1. Keywords Extraction: Ensure the query string is clear and contains relevant keywords for the Knowledge Graph. If the extracted keywords are not relevant, the search might not return any results.

  2. Graph Store Query Depth: Adjusting the graph_store_query_depth parameter in the KGTableRetriever class might help. If the depth is not sufficient to reach relevant nodes or too deep, causing the search to be too broad or too narrow, this could be the issue.

  3. Retriever Mode Configuration: The retriever_mode parameter should align with the available data and desired retrieval method. If it's set to EMBEDDING but the embeddings are not properly set up, the embedding-based retrieval will fail. Ensure that the embeddings are correctly configured.

  4. Embeddings Construction: Ensure that the Knowledge Graph index is constructed with embeddings if you intend to use the EMBEDDING or HYBRID modes. If the Knowledge Graph index was not constructed with embeddings, and the retriever mode is not set to KEYWORD, the system might not return relevant results.

Regarding performing a similarity_search_with_score similar to Langchain in LlamaIndex, ensure that the similarity_top_k parameter is set to the desired number of top embeddings to use. This involves generating embeddings for the query and comparing them with the embeddings of nodes in the Knowledge Graph to find the top-k similar items.

For more detailed information on these methods and parameters, you can refer to the LlamaIndex core code.

If there are further questions or need for clarification, feel free to reach out.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
JPonsa commented 6 months ago

@tomasonjo

Adding the keyword_index_name causes the error below. It complains the index already exists. (1) that is kind of the point. I would like to reuse existing indexes. (2) the error still shows even if I give it a new name, which I don't understand. it must be complaining about another index. The error does not show up if I don't specify the keyword_index_name or set it to what I assumed is the default keyword_index_name= "keyword".


ClientError Traceback (most recent call last) Cell In[5], line 14 11 Settings.llm = llm 12 Settings.embed_model = li_biobert ---> 14 neo4j_vector = Neo4jVectorStore( 15 username, 16 password, 17 url, 18 768, 19 node_label= node_label, 20 embedding_node_property= embedding_node_property, 21 text_node_property= text_node_properties[0], 22 index_name=index_name, 23 keyword_index_name= "adverse_event_kw", 24 hybrid_search=True 25 ) 27 loaded_index = VectorStoreIndex.from_vector_store(neo4j_vector).as_query_engine() 28 test= loaded_index.query(user_query)

File ....venv\Lib\site-packages\llama_index\vector_stores\neo4jvector\base.py:175, in Neo4jVectorStore.init(self, username, password, url, embedding_dimension, database, index_name, keyword_index_name, node_label, embedding_node_property, text_node_property, distance_strategy, hybrid_search, retrieval_query, **kwargs) 173 # If the FTS index doesn't exist yet 174 if not fts_node_label: --> 175 self.create_new_keyword_index() 176 else: # Validate that FTS and Vector index use the same information 177 if not fts_node_label == self.node_label:

File ....venv\Lib\site-packages\llama_index\vector_stores\neo4jvector\base.py:307, in Neo4jVectorStore.create_new_keyword_index(self, text_node_properties) 301 node_props = text_node_properties or [self.text_node_property] 302 fts_index_query = ( 303 f"CREATE FULLTEXT INDEX {self.keyword_index_name} " 304 f"FOR (n:{self.node_label}) ON EACH " 305 f"[{', '.join(['n.' + el + '' for el in node_props])}]" 306 ) --> 307 self.database_query(fts_index_query)

File ....venv\Lib\site-packages\llama_index\vector_stores\neo4jvector\base.py:326, in Neo4jVectorStore.database_query(self, query, params) 324 with self._driver.session(database=self._database) as session: 325 try: --> 326 data = session.run(query, params) 327 return [r.data() for r in data] 328 except CypherSyntaxError as e:

File ....venv\Lib\site-packages\neo4j_sync\work\session.py:313, in Session.run(self, query, parameters, kwargs) 311 bookmarks = self._get_bookmarks() 312 parameters = dict(parameters or {}, kwargs) --> 313 self._auto_result._run( 314 query, parameters, self._config.database, 315 self._config.impersonated_user, self._config.default_access_mode, 316 bookmarks, self._config.notifications_min_severity, 317 self._config.notifications_disabled_categories, 318 ) 320 return self._auto_result

File ....venv\Lib\site-packages\neo4j_sync\work\result.py:181, in Result._run(self, query, parameters, db, imp_user, access_mode, bookmarks, notifications_min_severity, notifications_disabled_categories) 179 self._pull() 180 self._connection.send_all() --> 181 self._attach()

File ....venv\Lib\site-packages\neo4j_sync\work\result.py:301, in Result._attach(self) 299 if self._exhausted is False: 300 while self._attached is False: --> 301 self._connection.fetch_message()

File ....venv\Lib\site-packages\neo4j_sync\io_common.py:178, in ConnectionErrorHandler.getattr..outer..inner(*args, kwargs) 176 def inner(*args, *kwargs): 177 try: --> 178 func(args, kwargs) 179 except (Neo4jError, ServiceUnavailable, SessionExpired) as exc: 180 assert not asyncio.iscoroutinefunction(self.__on_error)

File ....venv\Lib\site-packages\neo4j_sync\io_bolt.py:849, in Bolt.fetch_message(self) 845 # Receive exactly one message 846 tag, fields = self.inbox.pop( 847 hydration_hooks=self.responses[0].hydration_hooks 848 ) --> 849 res = self._process_message(tag, fields) 850 self.idle_since = monotonic() 851 return res

File ....venv\Lib\site-packages\neo4j_sync\io_bolt5.py:369, in Bolt5x0._process_message(self, tag, fields) 367 self._server_state_manager.state = self.bolt_states.FAILED 368 try: --> 369 response.on_failure(summary_metadata or {}) 370 except (ServiceUnavailable, DatabaseUnavailable): 371 if self.pool:

File ....venv\Lib\site-packages\neo4j_sync\io_common.py:245, in Response.on_failure(self, metadata) 243 handler = self.handlers.get("on_summary") 244 Util.callback(handler) --> 245 raise Neo4jError.hydrate(**metadata)

ClientError: {code: Neo.ClientError.Schema.IndexAlreadyExists} {message: There already exists an index (:Chunk {term}).}

tomasonjo commented 6 months ago

What do you get when you run this query. The code is identical in Llamindex and Langchain, btw...

SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type IN  ['FULLTEXT', 'VECTOR']
RETURN *
JPonsa commented 6 months ago

Please see the output below. I deleted the "keyword" index but the issue persists. I am sure the codes must be almost identical. but could there be subttle differences? For example. langchain accepts text_node_properties as a list, llama_index seems to accept only a str. Wondering if that is the cause. Note that in my example I had to use slightly different parameters for the text_node_property(es)

[
  {
    "name": "Condition",
    "type": "VECTOR",
    "labelsOrTypes": [
      "NamedThing"
    ],
    "properties": [
      "biobert_emb"
    ],
    "options": {
      "indexProvider": "vector-1.0",
      "indexConfig": {
        "vector.dimensions": 768,
        "vector.similarity_function": "cosine"
      }
    }
  },
  {
    "name": "adverse_event",
    "type": "VECTOR",
    "labelsOrTypes": [
      "AdverseEvent"
    ],
    "properties": [
      "biobert_emb"
    ],
    "options": {
      "indexProvider": "vector-1.0",
      "indexConfig": {
        "vector.dimensions": 768,
        "vector.similarity_function": "cosine"
      }
    }
  },
  {
    "name": "adverse_event_kw",
    "type": "FULLTEXT",
    "labelsOrTypes": [
      "AdverseEvent"
    ],
    "properties": [
      "term",
      "organ_system"
    ],
    "options": {
      "indexProvider": "fulltext-1.0",
      "indexConfig": {
        "fulltext.analyzer": "standard-no-stop-words",
        "fulltext.eventually_consistent": false
      }
    }
  },
  {
    "name": "condition",
    "type": "VECTOR",
    "labelsOrTypes": [
      "Condition"
    ],
    "properties": [
      "biobert_emb"
    ],
    "options": {
      "indexProvider": "vector-1.0",
      "indexConfig": {
        "vector.dimensions": 768,
        "vector.similarity_function": "cosine"
      }
    }
  },
  {
    "name": "condition_kw",
    "type": "FULLTEXT",
    "labelsOrTypes": [
      "Condition"
    ],
    "properties": [
      "id"
    ],
    "options": {
      "indexProvider": "fulltext-1.0",
      "indexConfig": {
        "fulltext.analyzer": "standard-no-stop-words",
        "fulltext.eventually_consistent": false
      }
    }
  },
  {
    "name": "intervention",
    "type": "VECTOR",
    "labelsOrTypes": [
      "Intervention"
    ],
    "properties": [
      "biobert_emb"
    ],
    "options": {
      "indexProvider": "vector-1.0",
      "indexConfig": {
        "vector.dimensions": 768,
        "vector.similarity_function": "cosine"
      }
    }
  },
  {
    "name": "intervention_kw",
    "type": "FULLTEXT",
    "labelsOrTypes": [
      "Intervention"
    ],
    "properties": [
      "id",
      "type"
    ],
    "options": {
      "indexProvider": "fulltext-1.0",
      "indexConfig": {
        "fulltext.analyzer": "standard-no-stop-words",
        "fulltext.eventually_consistent": false
      }
    }
  },
  {
    "name": "keyword",
    "type": "FULLTEXT",
    "labelsOrTypes": [
      "AdverseEvent"
    ],
    "properties": [
      "name",
      "type",
      "mapped_names"
    ],
    "options": {
      "indexProvider": "fulltext-1.0",
      "indexConfig": {
        "fulltext.analyzer": "standard-no-stop-words",
        "fulltext.eventually_consistent": false
      }
    }
  }
]
tomasonjo commented 6 months ago

We check the existence of FTS index by its name, so it should definitely find it, even though the index has multiple properties which LlamaIndex can't create, but can use.

            "SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options "
            "WHERE type = 'FULLTEXT' AND (name = $keyword_index_name "
            "OR (labelsOrTypes = [$node_label] AND "
            "properties = $text_node_property)) "
            "RETURN name, labelsOrTypes, properties, options ",

Note that from_existing_graph creates a custom retrieval_query that fetches all text properties. Unfortunately, from_existing_graph can't be reproduced in LlamaIndex easily as they have a different logic for instantiating vector stores and populating them.

If you can give me a reproducible example I can investigate more.

JPonsa commented 6 months ago

I sent you a dump of the db via email. I hope that is ok.

tomasonjo commented 6 months ago

I get this error because the nodes weren't created using LlamaIndex:

ValueError: Node content not found in metadata dict.

JPonsa commented 6 months ago

My starting point aren't document chunks. The nodes were created directly in Neo4j by uploading a set of CSV using the bin/neo4j-admin import.

Regarding the ValueError, I am not getting an error other than "Empy Reponse" on my end; a version thing?

The indexes were likely created when I used langchain's Neo4jVector.from_existing_graph. I supposse I could create them directly in Neo4j if that is not possible in Llamaindex.

tomasonjo commented 6 months ago

Probably a version thing. I use the latest versions as I just installed them. Yeah, LlamaIndex doesn't support multiple instantiation methods... If you want to make this work, I would...

I can't really help you beyond that as the libs are just too different.

kuguadawang12138 commented 6 months ago

@JPonsa Hello.I'm having the same problem as you.But I'm using KG-RAG.I didn't get anything but Empty Response.I think the underlying problem should all be the same.It should be a version issue that caused the input to not get into the parameters correctly.

kuguadawang12138 commented 6 months ago

If you have a way to connect neo4j with llamaindex, could you share it?

CodeAlpha7 commented 2 months ago

Facing the same issue here but with Langchain instead. I have an existing knowledge graph on AuraDB with each node having an embedding property. As a starting point, I simply want to instantiate an index and run a similarity search as shown below:

Code:

from langchain_community.embeddings.fastembed import FastEmbedEmbeddings from dotenv import load_dotenv from langchain_community.graphs import Neo4jGraph from langchain_community.vectorstores import Neo4jVector from pprint import pprint import os

embeddings=FastEmbedEmbeddings() neo4j_vector = Neo4jVector.from_existing_graph( embeddings, url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, index_name="neo4j_vector", node_label="unique_target_node", embedding_node_property="embeddings", text_node_properties=["name"] )

query = "tesla Model X" try:

Perform similarity search and print response

response = neo4j_vector.similarity_search(query)
print("Response:", response)

except Exception as e: print(f"Error: {e}")

Output:

"Response: [ ]"

Any suggestions on this?

kuguadawang12138 commented 2 months ago

这是来自QQ邮箱的假期自动回复邮件。你好,我已收到你的邮件,谢谢。