Closed sayanmndl21 closed 3 months ago
Hey @sayanmndl21! Great to see you back here asking the challenging questions. How's it going?🚀
Based on the information you've provided and the similar issues I found in the LlamaIndex repository, it seems like you're encountering a common problem when dealing with large documents: the total number of tokens in the document exceeds the model's maximum token limit.
To avoid this error, you can implement a strategy to break down the document into smaller chunks that fit within the model's token limit before summarizing each part. After summarizing the individual parts, you can then combine these summaries to form a final summary for the entire document. This approach ensures that each chunk of the document is within the model's context window limit, thus avoiding the error.
Here's a high-level approach to implement this strategy:
Chunking the Document: Before summarizing, divide the document into smaller parts. Each part should have a number of tokens less than the model's maximum token limit.
Summarizing Each Chunk: Use the DocumentSummaryIndex
class to summarize each chunk individually. This involves modifying or extending the _add_nodes_to_index
method to handle chunked input and generate summaries for each chunk.
Combining Summaries: After obtaining summaries for all chunks, you can optionally combine these summaries into a comprehensive summary that represents the entire document. This step might involve another round of summarization to ensure coherence and conciseness.
This approach is not directly outlined in the provided code, but you can implement it by extending the DocumentSummaryIndex
class or by preprocessing the documents before they are passed to the summarization methods. The key is to ensure that each piece of text fed into the summarization process does not exceed the model's token limit.
For example, you can use the TokenTextSplitter
class from the llama_index.text_splitter
module to split your text into smaller chunks that are within the token limit. Here is an example of how to use the TokenTextSplitter
:
from llama_index.text_splitter import TokenTextSplitter
# Your input text
text = "Your very long text..."
# Create a text splitter
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
# Split the text into chunks
chunks = text_splitter.split_text(text)
# Now you can process each chunk separately
for chunk in chunks:
# Your processing code here
pass
Adjust the chunk_size
and chunk_overlap
parameters as needed for your specific use case. If your text is tokenized in a specific way, you can pass the tokenizer's encoding function to the TokenTextSplitter
constructor.
For more details, you can refer to the following files in the LlamaIndex repository:
I hope this helps! If you have any other questions or need further clarification, feel free to ask.
I have a similar problem. @sayanmndl21 have you found a solution? Thanks in advance :)
Hey @taimans-git, unfortunately no. I have since then changed directions in favor of simpler methods like Retriever Query Engine which is performing better than expected.
I'd suggest custom retrievers (https://docs.llamaindex.ai/en/stable/examples/query_engine/CustomRetrievers/) and chromadb vector store if you are working with a large document base.
Hi @sayanmndl21, thanks for your suggestions. I tried different things and with other API models it seems to work. For example, for me it works with models from VertexAI or OpenAI. For MistralAI, I had to additionally set the context window to make it work, e.g.
Settings.context_window = 4096
Question Validation
Question
I am following this tutorial to create a small RAG engine with a local llama model (Mixtral 8x7B): Multi Document Agents v1. But this always leads to
ValueError: Requested tokens (xxx) exceed context window of yyy
atI have so far tried reducing chunk size, increase max_tokens to the limit, adding CompactAndRefine() as synthesizer but none of them works. Is an argument which I might be missing?