run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.67k stars 4.73k forks source link

index.delete(doc_id) #1286

Closed daveckw closed 1 year ago

daveckw commented 1 year ago

index.delete(doc_id) only deletes the "doc_id_dict", the actual document is not deleted.

{ "index_struct": { "__type__": "simple_dict", "__data__": { "index_id": "e8e73055-3108-404b-af48-ce36988caaca", "summary": null, "nodes_dict": { "076604ba-4cfb-4f39-83e8-980934292b47": "076604ba-4cfb-4f39-83e8-980934292b47", "26b07426-fd7d-4d02-be40-5659672d790b": "26b07426-fd7d-4d02-be40-5659672d790b", "a329d897-5577-4603-bd63-fb949b7c8312": "a329d897-5577-4603-bd63-fb949b7c8312" }, "doc_id_dict": { "doc_id_Hugoz Project2.txt": [ "076604ba-4cfb-4f39-83e8-980934292b47" ], "doc_id_IQI Exsim Policy2.txt": [ "26b07426-fd7d-4d02-be40-5659672d790b" ], "doc_id_Dave Chong2.txt": [ "a329d897-5577-4603-bd63-fb949b7c8312" ] }, "embeddings_dict": {} } }, "docstore": { "docs": { "076604ba-4cfb-4f39-83e8-980934292b47": { "text": "Hugoz KLCC Project Information:\n\nLaunch Date: APDL expected Q1 2023\nLand Area: 0.867 acres\nNumber of Blocks: 1 Tower\n\nTotal : 674 units\nNumber of Units\nNon HDA units : 354 units\nHDA units : 320 units\n\nNumber of Floors: 46 levels\nFreehold / Leasehold: Freehold\n\nCompletion Date: 48 months from A

For example, for I use index.delete(doc_id_Hugoz Project2.txt), the doc - 076604ba-4cfb-4f39-83e8-980934292b47, is still there.

May I know how to delete all the related docs from the docstore? Thank you.

hanchchch commented 1 year ago

Hi, I faced with the same issue. I wonder if you were using GPTVectorStoreIndex, because I found out that the method GPTVectorStoreIndex._delete only deletes the doc_id of it's index_struct and vector_store.

    def _delete(self, doc_id: str, **delete_kwargs: Any) -> None:
        """Delete a document."""
        self._index_struct.delete(doc_id)
        self._vector_store.delete(doc_id)

I think it also should delete it on docstore

    def _delete(self, doc_id: str, **delete_kwargs: Any) -> None:
        """Delete a document."""
        self._docstore.delete_document(doc_id)
        self._index_struct.delete(doc_id)
        self._vector_store.delete(doc_id)

Am I right? Should I make a PR about this?

logan-markewich commented 1 year ago

This should be fixed, see this page for detailed guide/usage!

https://gpt-index.readthedocs.io/en/latest/how_to/index_structs/document_management.html

Erik-M-Larsson commented 1 year ago

https://gpt-index.readthedocs.io/en/latest/how_to/index_structs/document_management.html

I get 404 Not found for this page.

bhanson-techempower commented 1 year ago

https://gpt-index.readthedocs.io/en/latest/how_to/index_structs/document_management.html

I get 404 Not found for this page.

The docs were recently refactored, here's the updated link:

https://gpt-index.readthedocs.io/en/latest/how_to/index/document_management.html

logan-markewich commented 1 year ago

Thanks for the updated link @bhanson-techempower

Since there is an option to delete from docstore in the documentation (its false by default because many indexes can share the same docstore. Even with it false, delete will stop it from being used in queries as the doc_id is removed from the index_struct)

, I'm going to close this for now, feel free to reopen though!