run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.07k stars 5.13k forks source link

[Question]: Refresh hash after metadata update #8602

Closed matthias closed 8 months ago

matthias commented 11 months ago

Question Validation

Question

I would like to use refresh_ref_docs to update meta_data. But the hash isn't updated after I change the metadata.

Could be a feature request / bug report ... or am I missing anything

ingest.py

documents = SimpleDirectoryReader(data_dir, filename_as_id=True).load_data()

index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir=persist_dir)

index.refresh_ref_docs(documents)
index.storage_context.persist(persist_dir=persist_dir)

print(index.ref_doc_info)

update_metadata.py

documents = SimpleDirectoryReader(data_dir, filename_as_id=True).load_data()

storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
index = load_index_from_storage(storage_context)

documents[0].metadata["example"] = "foo"

index.refresh_ref_docs(documents)
index.storage_context.persist(persist_dir=persist_dir)

print(index.ref_doc_info)

=> This won't update the nodes, as the hash doesn't get recalculated after I change metadata

Is this possible? Otherwise an Document.update_hash() method would be helpful -> or an update_hash() inside refresh_ref_docs()

dosubot[bot] commented 11 months ago

🤖

Hello @matthias,

I'm here to assist you with your question about the LlamaIndex repository.

From what I understand, you're trying to update the hash in LlamaIndex after changing the metadata of a document. The good news is that the refresh_ref_docs() method in the BaseIndex class is designed to handle this. It checks if the existing document hash is different from the new document hash. If they are different, it calls the update_ref_doc() method to update the document and its corresponding nodes. This process effectively updates the hash in the LlamaIndex.

Here is the relevant code:

def refresh_ref_docs(
    self, documents: Sequence[Document], **update_kwargs: Any
) -> List[bool]:
    """Refresh an index with documents that have changed.
    This allows users to save LLM and Embedding model calls, while only
    updating documents that have any changes in text or metadata. It
    will also insert any documents that previously were not stored.
    """
    with self._service_context.callback_manager.as_trace("refresh"):
        refreshed_documents = [False] * len(documents)
        for i, document in enumerate(documents):
            existing_doc_hash = self._docstore.get_document_hash(
                document.get_doc_id()
            )
            if existing_doc_hash is None:
                self.insert(document, **update_kwargs.pop("insert_kwargs", {}))
                refreshed_documents[i] = True
            elif existing_doc_hash != document.hash:
                self.update_ref_doc(
                    document, **update_kwargs.pop("update_kwargs", {})
                )
                refreshed_documents[i] = True

        return refreshed_documents

You can find this in the llama_index/indices/base.py file.

Therefore, there is no need for a Document.update_hash() method or an update_hash() inside refresh_ref_docs(). The existing methods already handle the hash update.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] commented 8 months ago

Hi, @matthias,

I'm helping the LlamaIndex team manage our backlog and am marking this issue as stale. From what I understand, the issue was raised regarding the hash not being updated after changing metadata using the refresh_ref_docs method. However, I provided a detailed response explaining that the refresh_ref_docs() method in the BaseIndex class is designed to handle hash updates when metadata changes. The existing methods already handle the hash update, so there is no need for additional methods.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation.