Closed mw19930312 closed 1 month ago
Hey there, @mw19930312! 🚀 Long time no bug, huh? 😄 I remember we've danced around some code together before! Taking a dive into your issue now and will have some insights for you shortly. Hang tight!
Ah, I think this might be due to how RAPTOR enables nodes to belong to more than one cluster (would explain why it only happens sometimes)
Maybe chroma isn't a good choice here heh
Will be very... tedious to fix this, but will take a look
The error message you're encountering indicates that there are duplicate IDs in the set of documents you're trying to index. In the LlamaIndex software, each document or node you're trying to index should have a unique identifier.
From the code you've provided, it seems like you're not explicitly setting the id_
parameter when creating a Document
object. The Document
class in LlamaIndex does not automatically generate unique IDs. If you don't provide an id_
when creating a Document
, the default value is None
, which could be causing the issue if multiple documents are created without explicit IDs.
Here's how you can modify your convert_to_llama_index_document
function to generate unique IDs for each document using the uuid
library:
import uuid
from llama_index.core.schema import Document
def convert_to_llama_index_document(parsed_google_docs):
documents = []
for doc in parsed_google_docs:
if len(doc.metadata['header_structure']) > 0:
header = doc.metadata['header_structure'][-1]
else:
header = ''
documents.append(
Document(
id_=str(uuid.uuid4()), # generate a unique ID
text=doc.content,
metadata={
"header": header
}
)
)
return documents
In this modified function, uuid.uuid4()
is used to generate a unique ID for each document. This should ensure that each document has a unique ID and resolve the error you're encountering.
Please try this modification and let me know if it resolves your issue or if you have any other questions.
Ah, I think this might be due to how RAPTOR enables nodes to belong to more than one cluster (would explain why it only happens sometimes)
This kind of makes sense to me because the error disappears if I make my document a bit shorter. Anyway, just let me know if you have found anything. Thanks for all the help!
I have the same problem, would love to have a solution / a recommandation to use another vectorstore than chroma.
Abandoning chromadb and switching to a different vector store (e.g. faiss) solved the issue for me.
Bug Description
I'm following https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-raptor/examples/raptor.ipynb to kick off a few RAPTOR-type experiments by myself. However, the code fails occasionally when creating the index due to duplicated Document ID. I've provided codes below, where llama_index_document is a list of Document objects looks like
Then, the error occurs when creating the RAPTOR index.
Version
0.10.15
Steps to Reproduce
from llama_index.core.schema import Document
def convert_to_llama_index_document(parsed_google_docs): documents = [] for doc in parsed_google_docs: if len(doc.metadata['header_structure']) > 0: header = doc.metadata['header_structure'][-1] else: header = '' documents.append( Document( text=doc.content, metadata={ "header": header } ) ) return documents
llama_index_documents = convert_to_llama_index_document(parsed_google_docs) raptor_pack_google_doc = RaptorPack( llama_index_documents, embed_model=OpenAIEmbedding( model="text-embedding-ada-002" ), # used for embedding clusters llm=OpenAI(model="gpt-3.5-turbo", temperature=0), # used for generating summaries vector_store=vector_store, # used for storage similarity_top_k=5, # top k for each layer, or overall top-k for collapsed mode="tree_traversal", # sets default mode transformations=[ SentenceSplitter(chunk_size=400, chunk_overlap=50) ], # transformations applied for ingestion )
Relevant Logs/Tracbacks