run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.36k stars 4.98k forks source link

[Bug]: RAPTOR failed due to duplicated Document ID, chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: 17ab7f1d-*********** #14711

Open satyasaipavansify opened 2 months ago

satyasaipavansify commented 2 months ago

Bug Description

I'm following https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-raptor/examples/raptor.ipynb to kick off a few RAPTOR-type experiments by myself. However, Im getting an issue when creating the index due to duplicated Document ID. I've provided codes below, where simple directory reader reads is a list of 12 documents.

Error Logs:

Loading files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:49<00:00, 4.11s/file] Current time: 06:45:08.876467 Generating embeddings for level 0. Performing clustering for level 0. Generating summaries for level 0 with 142 clusters. Level 0 created summaries/clusters: 142 Traceback (most recent call last): File "c:\Users\017912\Desktop\Dell ChatBot Workflow\ingestion.py", line 43, in RaptorPack( File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\llama_index\packs\raptor\base.py", line 348, in init self.retriever = RaptorRetriever( ^^^^^^^^^^^^^^^^ File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\llama_index\packs\raptor\base.py", line 139, in init asyncio.run(self.insert(documents)) File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\asyncio\runners.py", line 190, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\asyncio\runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\asyncio\base_events.py", line 654, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\llama_index\packs\raptor\base.py", line 225, File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\llama_index\core\indices\vector_store\base.py", line 330, in insert_nodes self._insert(nodes, insert_kwargs) File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\llama_index\core\indices\vector_store\base.py", line 312, in _insert self._add_nodes_to_index(self._index_struct, nodes, insert_kwargs) File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\llama_index\core\indices\vector_store\base.py", line 234, in _add_nodes_to_index new_ids = self._vector_store.add(nodes_batch, **insert_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\llama_index\vector_stores\chroma\base.py", line 265, in add self._collection.add( File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\chromadb\api\models\Collection.py", line 146, in add ) = self._validate_embedding_set( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\chromadb\api\models\Collection.py", line 545, in _validate_embedding_set valid_ids = validate_ids(maybe_cast_one_to_many_ids(ids)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\chromadb\api\types.py", line 255, in validate_ids raise errors.DuplicateIDError(message) chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: 17ab7f1d-b145-42c1-8dc3-8e2e29541823

Version

pip install llama-index-packs-raptor==0.1.3

Steps to Reproduce

Code:

from llama_index.core.settings import Settings Settings.llm = ingest_llm Settings.embed_model = embed_model Settings.context_window = 8192 Settings.num_output = 1024

documents = SimpleDirectoryReader(input_dir="./docs").load_data(show_progress=True) client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_or_create_collection("dump") vector_store = ChromaVectorStore(chroma_collection=collection)

RaptorPack( documents, embed_model=embed_model, llm=ingest_llm, vector_store=vector_store, mode="tree_traversal", transformations=[SentenceSplitter(chunk_size=1024, chunk_overlap=128)], tree_depth=3, verbose=True, )

Relevant Logs/Tracbacks

No response

logan-markewich commented 2 months ago

Chroma doesn't allow duplicate ids, which kind of needs to be handled for this process to work.

Try a different vector db

logan-markewich commented 2 months ago

There are duplicates because chunks can belong to more than one cluster

satyasaipavansify commented 2 months ago

There are duplicates because chunks can belong to more than one cluster

Could we utilize ChromaDB for this use case?

satyasaipavansify commented 2 months ago

Chroma doesn't allow duplicate ids, which kind of needs to be handled for this process to work. Try a different vector db

Can suggest an better vector store for this use case, which wont end up with duplicate issue.

logan-markewich commented 2 months ago

I think every other vector store I've tried handles this fine 😅 Maybe try qdrant

satyasaipavansify commented 2 months ago

I think every other vector store I've tried handles this fine 😅 Maybe try qdrant

When querying the index (Qdrant Vector DB), I encounter the following issue.

Code: from qdrant_client import QdrantClient from llama_index.vector_stores.qdrant import QdrantVectorStore client = QdrantClient(path="./qdrant") vector_store = QdrantVectorStore(client=client,collection_name="dump")

from llama_index.packs.raptor import RaptorRetriever retriever = RaptorRetriever( [], embed_model=embed_model, llm=llm, vector_store=vector_store, similarity_top_k=2, mode="tree_traversal", )

from llama_index.core.query_engine import RetrieverQueryEngine from llama_index.core.response_synthesizers.type import ResponseMode

query_engine = RetrieverQueryEngine.from_args( retriever, llm=llm, response_mode=ResponseMode.REFINE )

response = query_engine.query("Explain about catalogues") response

Error Logs:

Cell In[28], line 23 17 from llama_index.core.response_synthesizers.type import ResponseMode 19 query_engine = RetrieverQueryEngine.from_args( 20 retriever, llm=llm, response_mode=ResponseMode.REFINE 21 ) ---> 23 response = query_engine.query("Explain about catalogues") 24 response

File c:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\llama_index\core\instrumentation\dispatcher.py:230, in Dispatcher.span..wrapper(func, instance, args, kwargs) 226 self.span_enter( 227 id=id, bound_args=bound_args, instance=instance, parent_id=parent_id 228 ) 229 try: --> 230 result = func(*args, **kwargs) 231 except BaseException as e: 232 self.event(SpanDropEvent(spanid=id, err_str=str(e)))

File c:\Users\017912\AppData\Local\anaconda3\envs\dell_chatbot\Lib\site-packages\llama_index\core\base\base_query_engine.py:52, in BaseQueryEngine.query(self, str_or_query_bundle) 50 if isinstance(str_or_query_bundle, str): 51 str_or_query_bundle = QueryBundle(str_or_query_bundle) ---> 52 query_result = self._query(str_or_query_bundle) 53 dispatcher.event( 54 QueryEndEvent(query=str_or_query_bundle, response=query_result) ... 940 query_filter=query_filter, 941 ) 943 return self.parse_to_query_result(response)

AttributeError: 'NoneType' object has no attribute 'search'

logan-markewich commented 2 months ago
from qdrant_client import QdrantClient, AsyncQdrantClient

client = QdrantClient(path="./qdrant")
client = AsyncQdrantClient(path="./qdrant")
vector_store = QdrantVectorStore(client=client, aclient =aclient, collection_name="dump")
satyasaipavansify commented 1 month ago

@logan-markewich

I'm having a use case where i need to input the complete content (End to End with Images and tables) from multiple documents into an LLM/Generative AI model. The goal is to generate a single cohesive piece of content in a specified schema, such as a course or similar output.

This means the LLM/Generative AI model needs to have access to all the details from each document to produce content from them.

Are there any alternatives to RAG or model fine-tuning? With RAG, we're limited by context length and it relies solely on chunks.

Can you recommend an approach or way for implementing this use case?