[Bug]: RAPTOR failed due to duplicated Document ID

mw19930312 commented 8 months ago

Bug Description

I'm following https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-raptor/examples/raptor.ipynb to kick off a few RAPTOR-type experiments by myself. However, the code fails occasionally when creating the index due to duplicated Document ID. I've provided codes below, where llama_index_document is a list of Document objects looks like

[
Document(id_='dc2247a2-3a75-4fa1-baa0-4ce074cae205', embedding=None, metadata={'header': 'xxx Inc.'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='TABLE OF CONTENTS\nAccess Control Policy2 Asset Management Policy8 ', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),

 Document(id_='69ca22a5-37d7-4132-b83b-e631d0992ada', embedding=None, metadata={'header': 'Access Control Policy'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Policy Owner: \nEffective Date: 01/28/2024', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
]

Then, the error occurs when creating the RAPTOR index.

Version

0.10.15

Steps to Reproduce

from llama_index.core.schema import Document

def convert_to_llama_index_document(parsed_google_docs): documents = [] for doc in parsed_google_docs: if len(doc.metadata['header_structure']) > 0: header = doc.metadata['header_structure'][-1] else: header = '' documents.append( Document( text=doc.content, metadata={ "header": header } ) ) return documents

llama_index_documents = convert_to_llama_index_document(parsed_google_docs) raptor_pack_google_doc = RaptorPack( llama_index_documents, embed_model=OpenAIEmbedding( model="text-embedding-ada-002" ), # used for embedding clusters llm=OpenAI(model="gpt-3.5-turbo", temperature=0), # used for generating summaries vector_store=vector_store, # used for storage similarity_top_k=5, # top k for each layer, or overall top-k for collapsed mode="tree_traversal", # sets default mode transformations=[ SentenceSplitter(chunk_size=400, chunk_overlap=50) ], # transformations applied for ingestion )

Relevant Logs/Tracbacks

{
    "name": "DuplicateIDError",
    "message": "Expected IDs to be unique, found duplicates of: 75a3039f-51df-4c1c-9ef2-0f45a08e6cd7, 22cd71d3-134e-44dc-b529-99a62a7bec44, 8ded7cce-14e7-4984-a758-f3f81cc898dc",
    "stack": "---------------------------------------------------------------------------
DuplicateIDError                          Traceback (most recent call last)
Cell In[72], line 1
----> 1 raptor_pack_google_doc = RaptorPack(
      2     llama_index_documents,
      3     embed_model=OpenAIEmbedding(
      4         model=\"text-embedding-ada-002\"
      5     ),  # used for embedding clusters
      6     llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0),  # used for generating summaries
      7     vector_store=vector_store,  # used for storage
      8     similarity_top_k=5,  # top k for each layer, or overall top-k for collapsed
      9     mode=\"tree_traversal\",  # sets default mode
     10     transformations=[
     11         SentenceSplitter(chunk_size=400, chunk_overlap=50)
     12     ],  # transformations applied for ingestion
     13 )

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/llama_index/packs/raptor/base.py:343, in RaptorPack.__init__(self, documents, llm, embed_model, vector_store, similarity_top_k, mode, verbose, **kwargs)
    331 def __init__(
    332     self,
    333     documents: List[BaseNode],
   (...)
    340     **kwargs: Any,
    341 ) -> None:
    342     \"\"\"Init params.\"\"\"
--> 343     self.retriever = RaptorRetriever(
    344         documents,
    345         embed_model=embed_model,
    346         llm=llm,
    347         similarity_top_k=similarity_top_k,
    348         vector_store=vector_store,
    349         mode=mode,
    350         verbose=verbose,
    351         **kwargs,
    352     )

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/llama_index/packs/raptor/base.py:134, in RaptorRetriever.__init__(self, documents, tree_depth, similarity_top_k, llm, embed_model, vector_store, transformations, summary_module, existing_index, mode, **kwargs)
    131 self.similarity_top_k = similarity_top_k
    133 if len(documents) > 0:
--> 134     asyncio.run(self.insert(documents))

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/nest_asyncio.py:30, in _patch_asyncio.<locals>.run(main, debug)
     28 task = asyncio.ensure_future(main)
     29 try:
---> 30     return loop.run_until_complete(task)
     31 finally:
     32     if not task.done():

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/nest_asyncio.py:98, in _patch_loop.<locals>.run_until_complete(self, future)
     95 if not f.done():
     96     raise RuntimeError(
     97         'Event loop stopped before Future completed.')
---> 98 return f.result()

File /opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/futures.py:203, in Future.result(self)
    201 self.__log_traceback = False
    202 if self._exception is not None:
--> 203     raise self._exception.with_traceback(self._exception_tb)
    204 return self._result

File /opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py:277, in Task.__step(***failed resolving arguments***)
    273 try:
    274     if exc is None:
    275         # We use the `send` method directly, because coroutines
    276         # don't have `__iter__` and `__next__` methods.
--> 277         result = coro.send(None)
    278     else:
    279         result = coro.throw(exc)

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/llama_index/packs/raptor/base.py:220, in RaptorRetriever.insert(self, documents)
    217         node.embedding = id_to_embedding[node.id_]
    218         nodes_with_embeddings.append(node)
--> 220 self.index.insert_nodes(nodes_with_embeddings)
    222 # set the current nodes to the new nodes
    223 cur_nodes = new_nodes

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py:320, in VectorStoreIndex.insert_nodes(self, nodes, **insert_kwargs)
    313 def insert_nodes(self, nodes: Sequence[BaseNode], **insert_kwargs: Any) -> None:
    314     \"\"\"Insert nodes.
    315 
    316     NOTE: overrides BaseIndex.insert_nodes.
    317         VectorStoreIndex only stores nodes in document store
    318         if vector store does not store text
    319     \"\"\"
--> 320     self._insert(nodes, **insert_kwargs)
    321     self._storage_context.index_store.add_index_struct(self._index_struct)

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py:311, in VectorStoreIndex._insert(self, nodes, **insert_kwargs)
    309 def _insert(self, nodes: Sequence[BaseNode], **insert_kwargs: Any) -> None:
    310     \"\"\"Insert a document.\"\"\"
--> 311     self._add_nodes_to_index(self._index_struct, nodes, **insert_kwargs)

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py:233, in VectorStoreIndex._add_nodes_to_index(self, index_struct, nodes, show_progress, **insert_kwargs)
    231 for nodes_batch in iter_batch(nodes, self._insert_batch_size):
    232     nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
--> 233     new_ids = self._vector_store.add(nodes_batch, **insert_kwargs)
    235     if not self._vector_store.stores_text or self._store_nodes_override:
    236         # NOTE: if the vector store doesn't store text,
    237         # we need to add the nodes to the index struct and document store
    238         for node, new_id in zip(nodes_batch, new_ids):
    239             # NOTE: remove embedding from node to avoid duplication

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/llama_index/vector_stores/chroma/base.py:250, in ChromaVectorStore.add(self, nodes, **add_kwargs)
    247         ids.append(node.node_id)
    248         documents.append(node.get_content(metadata_mode=MetadataMode.NONE))
--> 250     self._collection.add(
    251         embeddings=embeddings,
    252         ids=ids,
    253         metadatas=metadatas,
    254         documents=documents,
    255     )
    256     all_ids.extend(ids)
    258 return all_ids

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/chromadb/api/models/Collection.py:146, in Collection.add(self, ids, embeddings, metadatas, documents, images, uris)
    104 def add(
    105     self,
    106     ids: OneOrMany[ID],
   (...)
    116     uris: Optional[OneOrMany[URI]] = None,
    117 ) -> None:
    118     \"\"\"Add embeddings to the data store.
    119     Args:
    120         ids: The ids of the embeddings you wish to add
   (...)
    136 
    137     \"\"\"
    139     (
    140         ids,
    141         embeddings,
    142         metadatas,
    143         documents,
    144         images,
    145         uris,
--> 146     ) = self._validate_embedding_set(
    147         ids, embeddings, metadatas, documents, images, uris
    148     )
    150     # We need to compute the embeddings if they're not provided
    151     if embeddings is None:
    152         # At this point, we know that one of documents or images are provided from the validation above

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/chromadb/api/models/Collection.py:545, in Collection._validate_embedding_set(self, ids, embeddings, metadatas, documents, images, uris, require_embeddings_or_data)
    523 def _validate_embedding_set(
    524     self,
    525     ids: OneOrMany[ID],
   (...)
    543     Optional[URIs],
    544 ]:
--> 545     valid_ids = validate_ids(maybe_cast_one_to_many_ids(ids))
    546     valid_embeddings = (
    547         validate_embeddings(
    548             self._normalize_embeddings(maybe_cast_one_to_many_embedding(embeddings))
   (...)
    551         else None
    552     )
    553     valid_metadatas = (
    554         validate_metadatas(maybe_cast_one_to_many_metadata(metadatas))
    555         if metadatas is not None
    556         else None
    557     )

File ~/Desktop/raptor/.venv/lib/python3.11/site-packages/chromadb/api/types.py:255, in validate_ids(ids)
    251         example_string = (
    252             f\"{', '.join(examples[:5])}, ..., {', '.join(examples[-5:])}\"
    253         )
    254         message = f\"Expected IDs to be unique, found {n_dups} duplicated IDs: {example_string}\"
--> 255     raise errors.DuplicateIDError(message)
    256 return ids

DuplicateIDError: Expected IDs to be unique, found duplicates of: 75a3039f-51df-4c1c-9ef2-0f45a08e6cd7, 22cd71d3-134e-44dc-b529-99a62a7bec44, 8ded7cce-14e7-4984-a758-f3f81cc898dc"
}

dosubot[bot] commented 8 months ago

Hey there, @mw19930312! 🚀 Long time no bug, huh? 😄 I remember we've danced around some code together before! Taking a dive into your issue now and will have some insights for you shortly. Hang tight!

logan-markewich commented 8 months ago

Ah, I think this might be due to how RAPTOR enables nodes to belong to more than one cluster (would explain why it only happens sometimes)

logan-markewich commented 8 months ago

Maybe chroma isn't a good choice here heh

logan-markewich commented 8 months ago

Will be very... tedious to fix this, but will take a look

dosubot[bot] commented 8 months ago

The error message you're encountering indicates that there are duplicate IDs in the set of documents you're trying to index. In the LlamaIndex software, each document or node you're trying to index should have a unique identifier.

From the code you've provided, it seems like you're not explicitly setting the id_ parameter when creating a Document object. The Document class in LlamaIndex does not automatically generate unique IDs. If you don't provide an id_ when creating a Document, the default value is None, which could be causing the issue if multiple documents are created without explicit IDs.

Here's how you can modify your convert_to_llama_index_document function to generate unique IDs for each document using the uuid library:

import uuid
from llama_index.core.schema import Document

def convert_to_llama_index_document(parsed_google_docs):
    documents = []
    for doc in parsed_google_docs:
        if len(doc.metadata['header_structure']) > 0:
            header = doc.metadata['header_structure'][-1]
        else:
            header = ''
        documents.append(
            Document(
                id_=str(uuid.uuid4()),  # generate a unique ID
                text=doc.content,
                metadata={
                    "header": header
                }
            )
        )
    return documents

In this modified function, uuid.uuid4() is used to generate a unique ID for each document. This should ensure that each document has a unique ID and resolve the error you're encountering.

Please try this modification and let me know if it resolves your issue or if you have any other questions.

Sources

llama-index-legacy/llama_index/legacy/indices/base.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

mw19930312 commented 8 months ago

Ah, I think this might be due to how RAPTOR enables nodes to belong to more than one cluster (would explain why it only happens sometimes)

This kind of makes sense to me because the error disappears if I make my document a bit shorter. Anyway, just let me know if you have found anything. Thanks for all the help!

GabrielFo commented 6 months ago

I have the same problem, would love to have a solution / a recommandation to use another vectorstore than chroma.

kaigexie commented 4 months ago

Abandoning chromadb and switching to a different vector store (e.g. faiss) solved the issue for me.

run-llama / llama_index