[Question]: how can i solve the problem that property_graph_store.json can not be saved as utf8?

run-llama / llama_index

LlamaIndex is a data framework for your LLM applications

https://docs.llamaindex.ai

MIT License

37.04k stars 5.31k forks source link

[Question]: how can i solve the problem that property_graph_store.json can not be saved as utf8? #15897

Open 912100012 opened 2 months ago

912100012 commented 2 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

when I use the propertygraphindex to persist to disk，the other file such as “default__vector_store.json” saved as utf8 because it use the "json.dump(assic=True)" when vector_store.persist() is called.But the propertygraph.persist use the "model_dump_json()" when its persist() is called,the "model_dump_json()" do not have the assic=True paramaeter to cover the chinese to the format like “\u6235" So how can i solve this problem,only the property_graph_store.json file can not be saved as utf-8.

dosubot[bot] commented 2 months ago

To address the issue of ensuring UTF-8 encoding when persisting the PropertyGraphIndex, you can modify the persist() method to handle JSON serialization with the ensure_ascii=True parameter. Since the model_dump_json() method does not support this parameter directly, you can manually handle the JSON serialization.

Here's an example of how you can override the persist() method to achieve this:

import json
import fsspec

class CustomPropertyGraphIndex(PropertyGraphIndex):
    def persist(self, persist_path: str, fs: Optional[fsspec.AbstractFileSystem] = None) -> None:
        """Persist the graph store to a file with ensure_ascii=True."""
        if fs is None:
            fs = fsspec.filesystem("file")
        with fs.open(persist_path, "w") as f:
            json.dump(self.graph.model_dump(), f, ensure_ascii=True)

# Usage
custom_index = CustomPropertyGraphIndex(
    nodes=[],  # Your list of nodes
    llm=llm,  # Your language model
    kg_extractors=[custom_transform],  # Your custom transform
    property_graph_store=SimplePropertyGraphStore(),
    use_async=True,
    embed_kg_nodes=True,
    show_progress=True
)

custom_index.persist("path/to/property_graph_store.json")

In this example, the persist() method is overridden to use json.dump() with ensure_ascii=True, ensuring that non-ASCII characters are properly escaped.

This approach should solve the UTF-8 encoding issue for the property_graph_store.json file ^[1]^[2].

To continue talking to Dosu, mention @dosu.

912100012 commented 2 months ago

why error happen？code and error as follow:@dosu code: storage_context = StorageContext.from_defaults(persist_dir="../storage/yanbao-SchemaEXT") index = load_index_from_storage(storage_context) query_engine = index.as_query_engine()

streaming_response = query_engine.query("请你说一下，文字的中心思想？") streaming_response.print_response_stream()

error: Traceback (most recent call last): File "F:\Project\pythonWorkspace\text2Sql-Demo\test\testCn.py", line 12, in streaming_response = query_engine.query("请你说一下，文字的中心思想？") File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 265, in wrapper result = func(*args, kwargs) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\base\base_query_engine.py", line 52, in query query_result = self._query(str_or_query_bundle) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 265, in wrapper result = func(*args, *kwargs) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\query_engine\retriever_query_engine.py", line 175, in _query nodes = self.retrieve(query_bundle) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\query_engine\retriever_query_engine.py", line 130, in retrieve nodes = self._retriever.retrieve(query_bundle) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 265, in wrapper result = func(args, kwargs) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\base\base_retriever.py", line 245, in retrieve nodes = self._retrieve(query_bundle) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 265, in wrapper result = func(*args, kwargs) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\indices\property_graph\retriever.py", line 52, in _retrieve return asyncio_run(self._aretrieve(query_bundle)) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\async_utils.py", line 33, in asyncio_run return loop.run_until_complete(coro) File "E:\Python\Python39\lib\asyncio\base_events.py", line 642, in run_until_complete return future.result() File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\indices\property_graph\retriever.py", line 64, in _aretrieve async_results = await run_jobs( File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 297, in async_wrapper result = await func(*args, kwargs) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\async_utils.py", line 148, in run_jobs results = await asyncio.gather(pool_jobs) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 297, in async_wrapper result = await func(args, kwargs) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\async_utils.py", line 139, in worker return await job File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 297, in async_wrapper result = await func(*args, kwargs) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\base\base_retriever.py", line 276, in aretrieve nodes = await self._aretrieve(query_bundle=query_bundle) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\indices\property_graph\sub_retrievers\base.py", line 145, in _aretrieve nodes = await self.async_add_source_text(nodes) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\indices\property_graph\sub_retrievers\base.py", line 134, in async_add_source_text return self._add_source_text(nodes, og_node_map) File "F:\Project\pythonWorkspace\text2Sql-Demo\venv\lib\site-packages\llama_index\core\indices\property_graph\sub_retrievers\base.py", line 89, in _add_source_text assert isinstance(node, BaseNode) or node is None AssertionError

Process finished with exit code 1

dosubot[bot] commented 2 months ago

The error you're encountering is due to an assertion failure in the _add_source_text method, which checks if each node is an instance of BaseNode or None. This assertion is failing, indicating that one or more elements in the nodes list do not meet this condition.

Here are a few steps to help you debug and resolve this issue:

Check the Type of Nodes: Ensure that all elements in the nodes list passed to the async_add_source_text method are either instances of BaseNode or None.
Debugging the Nodes: Add some debugging statements to log the type of each node before the assertion to identify which node is causing the issue.

Here's an example of how you can modify the _add_source_text method to include debugging statements:

def _add_source_text(
    self, retrieved_nodes: List[NodeWithScore], og_node_map: Dict[str, BaseNode]
) -> List[NodeWithScore]:
    for node in retrieved_nodes:
        if not isinstance(node, BaseNode) and node is not None:
            print(f"Invalid node type: {type(node)}")
        assert isinstance(node, BaseNode) or node is None
    # Rest of your method implementation

Verify the Source of Nodes: Ensure that the nodes being retrieved and passed to the async_add_source_text method are correctly instantiated and not being altered in a way that changes their type.
Check the Retrieval Process: Verify the retrieval process in the _aretrieve method to ensure that it correctly retrieves nodes of the expected type.

By following these steps, you should be able to identify and resolve the issue causing the assertion error. If you find that a specific node type is causing the problem, you can trace back to where it is being created or modified and ensure it adheres to the expected type.

For more context, the nodes variable in the _add_source_text method is expected to be a list of NodeWithScore objects ^[1].

To continue talking to Dosu, mention @dosu.

912100012 commented 2 months ago

why my code error happend? ----------------code-------------------- storage_context = StorageContext.from_defaults(persist_dir="../storage/testnew")

--------------error------------ raise ValueError(f"Could not infer node type for data: {node_dict!s}") ValueError: Could not infer node type for data {'label': 'text_chunk', 'embedding': [0.049393251550212604, ...........................

@dosu

dosubot[bot] commented 2 months ago

The ValueError with the message "Could not infer node type for data" occurs because the StorageContext is unable to determine the type of node being added to the vector_store or docstore. This can happen if the node type is not properly defined or if the data being added does not match the expected structure.

To resolve this issue, ensure that the node type is correctly defined and that the data structure matches the expected format. Here is an example of how to properly create and add a TextNode to the vector_store and docstore:

from llama_index import StorageContext, TextNode, IndexDict

# Create a TextNode with the required attributes
node = TextNode(text="test", embedding=[0.0, 0.0, 0.0])

# Initialize StorageContext with the specified directory
storage_context = StorageContext.from_defaults(persist_dir="../storage/testnew")

# Add the node to the vector_store and docstore
storage_context.vector_store.add([node])
storage_context.docstore.add_documents([node])

# Create and add an IndexDict to the index_store
index_struct = IndexDict()
storage_context.index_store.add_index_struct(index_struct)

Ensure that the TextNode class and the methods add, add_documents, and add_index_struct are correctly implemented and handle the node type properly. If the node type cannot be inferred, it may be due to a mismatch or missing implementation in these methods.

Additionally, if you encounter an error indicating that storage files are missing, you may need to create and store a sample VectorStore before initializing the StorageContext. This can be done by ensuring that the necessary storage files are present in the specified directory before initializing the StorageContext.

For example:

from llama_index import StorageContext, ServiceContext, VectorStoreIndex, BeautifulSoupWebReader, LangchainEmbedding, LLMPredictor
from openai import OpenAI

# Initialize the necessary components
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-002"))
storage_context = StorageContext.from_defaults(persist_dir='../storage/testnew')
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm_predictor=llm_predictor)

# Load documents
documents = BeautifulSoupWebReader().load_data(urls)

# Create and persist the index if it does not exist
try:
    index = load_index_from_storage(storage_context, 'HackerNews')
except Exception as e:
    index = VectorStoreIndex.from_documents(documents, service_context=service_context, storage_context=storage_context)
    index.storage_context.persist()

This ensures that the necessary storage files are created and available for the StorageContext to function correctly ^[1]^[2].

To continue talking to Dosu, mention @dosu.

912100012 commented 2 months ago

Request body: {'model': 'qwen-turbo', 'parameters': {'max_tokens': 1000, 'temperature': 0.1, 'seed': 1234, 'result_format': 'message'}, 'input': {'messages': [{'role': 'user', 'content': "Given some initial query, generate synonyms or related keywords up to 10 in total, considering possible cases of capitalization, pluralization, common expressions, etc.\nProvide all synonyms/keywords separated by '^' symbols: 'keyword1^keyword2^...'\nNote, result should be in one-line, separated by '^' symbols.----\nQUERY: 老陈和女主的一些事，请你进行总结\n----\nKEYWORDS: "}]}} DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dashscope.aliyuncs.com:443 Starting new HTTPS connection (1): dashscope.aliyuncs.com:443 DEBUG:urllib3.connectionpool:https://dashscope.aliyuncs.com:443 "POST /api/v1/services/aigc/text-generation/generation HTTP/11" 200 None https://dashscope.aliyuncs.com:443 "POST /api/v1/services/aigc/text-generation/generation HTTP/11" 200 None DEBUG:dashscope:Response: {'output': {'choices': [{'finish_reason': 'stop', 'message': {'role': 'assistant', 'content': '老陈^女主^一些事情^总结^关系^情节^发展^互动^故事^事件'}}]}, 'usage': {'total_tokens': 117, 'output_tokens': 21, 'input_tokens': 96}, 'request_id': 'ef9dd55c-ffa6-9548-bbdb-ae05265b9542'} Response: {'output': {'choices': [{'finish_reason': 'stop', 'message': {'role': 'assistant', 'content': '老陈^女主^一些事情^总结^关系^情节^发展^互动^故事^事件'}}]}, 'usage': {'total_tokens': 117, 'output_tokens': 21, 'input_tokens': 96}, 'request_id': 'ef9dd55c-ffa6-9548-bbdb-ae05265b9542'}

这是我在控制台上看到的，这一步在干什么，为什么要把我的问题，先用大模型进行同义词总结？后续的意义是什么？@dosu