run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.55k stars 5.02k forks source link

[Bug]: Failed to load persisted PropertyGraph data with ValueError #15798

Open frontier-repository opened 2 weeks ago

frontier-repository commented 2 weeks ago

Bug Description

Description:

I encountered an issue while trying the PropertyGraph feature in LlamaIndex with Azure OpenAI gpt-4o mini. It failed to load persisted data using the load_index_from_storage function after successfully persisting it with the persist function in StorageContext.

Expected Behavior:

The persisted PropertyGraph data should be loaded without any errors.

Environment:

Example Code:

os.environ["OPENAI_API_KEY"] = "****"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://****.openai.azure.com"
os.environ["OPENAI_API_VERSION"] = "2024-06-01"

Settings.llm = AzureOpenAI(
    deployment_name="****",
    temperature=0.2,
)

Settings.embed_model = AzureOpenAIEmbedding(
  model="text-embedding-3-small",
  deployment_name="****",
)

index = PropertyGraphIndex.from_documents(
    documents,
    show_progress=True,
)

index.storage_context.persist(persist_dir=output_folder)

storage_context = StorageContext.from_defaults(persist_dir=output_folder)
index = load_index_from_storage(storage_context=storage_context)

Additional Context:

The issue appears to be related to the structure of the node data in the persisted file. Any insights or potential fixes would be greatly appreciated.

Version

0.11.4

Steps to Reproduce

  1. Persist data using the persist function in StorageContext.

  2. Attempt to load the persisted data using the load_index_from_storage function with the following code:

    storage_context = StorageContext.from_defaults(persist_dir=output_folder)
    index = load_index_from_storage(storage_context=storage_context)
  3. Observe the error.

Relevant Logs/Tracbacks

An error occurs during the loading process with `ValueError`, which describes that the node type could not be inferred for the given data. Refer the full traceback of the error below:

ValueError: Could not infer node type for data: {'label': 'text_chunk', 'embedding': [0.015958545729517937, ****], 'properties': {'file_name': 'input.docx', 'file_path': 'data\\input.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 269519, 'creation_date': '2024-08-30', 'last_modified_date': '2024-08-30', '_node_content': '{"id_": "9f4c67b8-b055-4caa-9e52-5d8ffa6d3c11", "embedding": null, "metadata": {"file_name": "input.docx", "file_path": "data\\\\input.docx", "file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "file_size": 269519, "creation_date": "2024-08-30", "last_modified_date": "2024-08-30"}, "excluded_embed_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "excluded_llm_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "relationships": {"1": {"node_id": "a655e0cd-ba06-47a8-9aa1-5341cc0e7446", "node_type": "4", "metadata": {"file_name": "input.docx", "file_path": "data\\\\input.docx", "file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "file_size": 269519, "creation_date": "2024-08-30", "last_modified_date": "2024-08-30"}, "hash": "b8565428e99f410e8069f23c8b8d5661bb90216ddc5b5b4b3bc594d7e155f404", "class_name": "RelatedNodeInfo"}, "3": {"node_id": "1cf26f1a-2592-4d47-9ecf-93b320d47567", "node_type": "1", "metadata": {}, "hash": "e86e358e00b2bb463ccd26e8565c0afb89eac72f3e75e877772c30cb048c94a4", "class_name": "RelatedNodeInfo"}}, "text": "", "mimetype": "text/plain", "start_char_idx": 0, "end_char_idx": 1345, "text_template": "{metadata_str}\\n\\n{content}", "metadata_template": "{key}: {value}", "metadata_seperator": "\\n", "class_name": "TextNode"}', '_node_type': 'TextNode', 'document_id': 'a655e0cd-ba06-47a8-9aa1-5341cc0e7446', 'doc_id': 'a655e0cd-ba06-47a8-9aa1-5341cc0e7446', 'ref_doc_id': 'a655e0cd-ba06-47a8-9aa1-5341cc0e7446'}}
dosubot[bot] commented 2 weeks ago

Hello @frontier-repository! I'm here to assist you with any bugs, questions, or contributions you may have.

I found a similar closed issue that might be relevant to your problem: [Question]: Can not load my index created locally. The suggested solution in that thread was to ensure that when using a vector database, all data is persisted in the vector store alone, and there is no need to call persist or load separately. To connect to an existing index created in a vector store like Chroma, you can use the following code:

vector_store = <points to existing ChromaVectorStore>
index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)

For your specific issue, the ValueError you're encountering indicates that the node type could not be inferred from the persisted data. Here are the specific requirements for the load_index_from_storage function to successfully infer the node type from the persisted data:

  1. Storage Context: Ensure the StorageContext is correctly initialized from the persisted directory.

    storage_context = StorageContext.from_defaults(persist_dir=output_folder)
  2. Object Node Mapping: If an object_node_mapping is not provided, the function attempts to load it from the persisted directory. The function assumes SimpleObjectNodeMapping for simplicity.

    object_node_mapping = SimpleObjectNodeMapping.from_persist_dir(persist_dir=output_folder)
  3. Single Index Constraint: The storage context should contain only a single index. If multiple indices are present, the function will raise a ValueError.

  4. Service Context: Ensure the service_context is provided to ensure that the loaded index can operate correctly.

    new_index = load_index_from_storage(storage_context=storage_context, service_context=mock_service_context)
  5. Persistence of Object Node Mapping: The object_node_mapping must be persisted correctly. If the persistence fails, a warning is issued, and the user must reconstruct the same object node mapping to build the ObjectIndex.

Given these requirements, please ensure that your persisted data meets these conditions. If the issue persists, you might want to check the structure of your node data in the persisted file. The expected structure of the node data for the PropertyGraph feature is defined by the props_schema parameter in the get_store function:

props_schema: str = "`file_path` STRING, `file_name` STRING, `file_type` STRING, `file_size` INT, `creation_date` STRING, `last_modified_date` STRING, `_node_content` STRING, `_node_type` STRING, `document_id` STRING, `doc_id` STRING, `ref_doc_id` STRING, `triplet_source_id` STRING"

Ensure that each node in your persisted data has the following properties:

Additionally, the persisted file may include other metadata and relationships for each node, such as:

By ensuring these conditions and structures, you should be able to load the persisted PropertyGraph data without encountering the ValueError [1][2][3][4].

To continue talking to Dosu, mention @dosu.