Open 912100012 opened 2 months ago
To address the issue of ensuring UTF-8 encoding when persisting the PropertyGraphIndex
, you can modify the persist()
method to handle JSON serialization with the ensure_ascii=True
parameter. Since the model_dump_json()
method does not support this parameter directly, you can manually handle the JSON serialization.
Here's an example of how you can override the persist()
method to achieve this:
import json
import fsspec
class CustomPropertyGraphIndex(PropertyGraphIndex):
def persist(self, persist_path: str, fs: Optional[fsspec.AbstractFileSystem] = None) -> None:
"""Persist the graph store to a file with ensure_ascii=True."""
if fs is None:
fs = fsspec.filesystem("file")
with fs.open(persist_path, "w") as f:
json.dump(self.graph.model_dump(), f, ensure_ascii=True)
# Usage
custom_index = CustomPropertyGraphIndex(
nodes=[], # Your list of nodes
llm=llm, # Your language model
kg_extractors=[custom_transform], # Your custom transform
property_graph_store=SimplePropertyGraphStore(),
use_async=True,
embed_kg_nodes=True,
show_progress=True
)
custom_index.persist("path/to/property_graph_store.json")
In this example, the persist()
method is overridden to use json.dump()
with ensure_ascii=True
, ensuring that non-ASCII characters are properly escaped.
This approach should solve the UTF-8 encoding issue for the property_graph_store.json
file [1][2].
To continue talking to Dosu, mention @dosu.
why error happen?code and error as follow:@dosu code: storage_context = StorageContext.from_defaults(persist_dir="../storage/yanbao-SchemaEXT") index = load_index_from_storage(storage_context) query_engine = index.as_query_engine()
error:
Traceback (most recent call last):
File "F:\Project\pythonWorkspace\text2Sql-Demo\test\testCn.py", line 12, in
Process finished with exit code 1
The error you're encountering is due to an assertion failure in the _add_source_text
method, which checks if each node
is an instance of BaseNode
or None
. This assertion is failing, indicating that one or more elements in the nodes
list do not meet this condition.
Here are a few steps to help you debug and resolve this issue:
Check the Type of Nodes:
Ensure that all elements in the nodes
list passed to the async_add_source_text
method are either instances of BaseNode
or None
.
Debugging the Nodes: Add some debugging statements to log the type of each node before the assertion to identify which node is causing the issue.
Here's an example of how you can modify the _add_source_text
method to include debugging statements:
def _add_source_text(
self, retrieved_nodes: List[NodeWithScore], og_node_map: Dict[str, BaseNode]
) -> List[NodeWithScore]:
for node in retrieved_nodes:
if not isinstance(node, BaseNode) and node is not None:
print(f"Invalid node type: {type(node)}")
assert isinstance(node, BaseNode) or node is None
# Rest of your method implementation
Verify the Source of Nodes:
Ensure that the nodes being retrieved and passed to the async_add_source_text
method are correctly instantiated and not being altered in a way that changes their type.
Check the Retrieval Process:
Verify the retrieval process in the _aretrieve
method to ensure that it correctly retrieves nodes of the expected type.
By following these steps, you should be able to identify and resolve the issue causing the assertion error. If you find that a specific node type is causing the problem, you can trace back to where it is being created or modified and ensure it adheres to the expected type.
For more context, the nodes
variable in the _add_source_text
method is expected to be a list of NodeWithScore
objects [1].
To continue talking to Dosu, mention @dosu.
why my code error happend? ----------------code-------------------- storage_context = StorageContext.from_defaults(persist_dir="../storage/testnew")
--------------error------------ raise ValueError(f"Could not infer node type for data: {node_dict!s}") ValueError: Could not infer node type for data {'label': 'text_chunk', 'embedding': [0.049393251550212604, ...........................
@dosu
The ValueError
with the message "Could not infer node type for data" occurs because the StorageContext
is unable to determine the type of node being added to the vector_store
or docstore
. This can happen if the node type is not properly defined or if the data being added does not match the expected structure.
To resolve this issue, ensure that the node type is correctly defined and that the data structure matches the expected format. Here is an example of how to properly create and add a TextNode
to the vector_store
and docstore
:
from llama_index import StorageContext, TextNode, IndexDict
# Create a TextNode with the required attributes
node = TextNode(text="test", embedding=[0.0, 0.0, 0.0])
# Initialize StorageContext with the specified directory
storage_context = StorageContext.from_defaults(persist_dir="../storage/testnew")
# Add the node to the vector_store and docstore
storage_context.vector_store.add([node])
storage_context.docstore.add_documents([node])
# Create and add an IndexDict to the index_store
index_struct = IndexDict()
storage_context.index_store.add_index_struct(index_struct)
Ensure that the TextNode
class and the methods add
, add_documents
, and add_index_struct
are correctly implemented and handle the node type properly. If the node type cannot be inferred, it may be due to a mismatch or missing implementation in these methods.
Additionally, if you encounter an error indicating that storage files are missing, you may need to create and store a sample VectorStore
before initializing the StorageContext
. This can be done by ensuring that the necessary storage files are present in the specified directory before initializing the StorageContext
.
For example:
from llama_index import StorageContext, ServiceContext, VectorStoreIndex, BeautifulSoupWebReader, LangchainEmbedding, LLMPredictor
from openai import OpenAI
# Initialize the necessary components
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-002"))
storage_context = StorageContext.from_defaults(persist_dir='../storage/testnew')
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm_predictor=llm_predictor)
# Load documents
documents = BeautifulSoupWebReader().load_data(urls)
# Create and persist the index if it does not exist
try:
index = load_index_from_storage(storage_context, 'HackerNews')
except Exception as e:
index = VectorStoreIndex.from_documents(documents, service_context=service_context, storage_context=storage_context)
index.storage_context.persist()
This ensures that the necessary storage files are created and available for the StorageContext
to function correctly [1][2].
To continue talking to Dosu, mention @dosu.
Request body: {'model': 'qwen-turbo', 'parameters': {'max_tokens': 1000, 'temperature': 0.1, 'seed': 1234, 'result_format': 'message'}, 'input': {'messages': [{'role': 'user', 'content': "Given some initial query, generate synonyms or related keywords up to 10 in total, considering possible cases of capitalization, pluralization, common expressions, etc.\nProvide all synonyms/keywords separated by '^' symbols: 'keyword1^keyword2^...'\nNote, result should be in one-line, separated by '^' symbols.----\nQUERY: 老陈和女主的一些事,请你进行总结\n----\nKEYWORDS: "}]}} DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dashscope.aliyuncs.com:443 Starting new HTTPS connection (1): dashscope.aliyuncs.com:443 DEBUG:urllib3.connectionpool:https://dashscope.aliyuncs.com:443 "POST /api/v1/services/aigc/text-generation/generation HTTP/11" 200 None https://dashscope.aliyuncs.com:443 "POST /api/v1/services/aigc/text-generation/generation HTTP/11" 200 None DEBUG:dashscope:Response: {'output': {'choices': [{'finish_reason': 'stop', 'message': {'role': 'assistant', 'content': '老陈^女主^一些事情^总结^关系^情节^发展^互动^故事^事件'}}]}, 'usage': {'total_tokens': 117, 'output_tokens': 21, 'input_tokens': 96}, 'request_id': 'ef9dd55c-ffa6-9548-bbdb-ae05265b9542'} Response: {'output': {'choices': [{'finish_reason': 'stop', 'message': {'role': 'assistant', 'content': '老陈^女主^一些事情^总结^关系^情节^发展^互动^故事^事件'}}]}, 'usage': {'total_tokens': 117, 'output_tokens': 21, 'input_tokens': 96}, 'request_id': 'ef9dd55c-ffa6-9548-bbdb-ae05265b9542'}
这是我在控制台上看到的,这一步在干什么,为什么要把我的问题,先用大模型进行同义词总结?后续的意义是什么?@dosu
Question Validation
Question
when I use the propertygraphindex to persist to disk,the other file such as “default__vector_store.json” saved as utf8 because it use the "json.dump(assic=True)" when vector_store.persist() is called.But the propertygraph.persist use the "model_dump_json()" when its persist() is called,the "model_dump_json()" do not have the assic=True paramaeter to cover the chinese to the format like “\u6235" So how can i solve this problem,only the property_graph_store.json file can not be saved as utf-8.