Open arun13ak opened 1 month ago
To handle the issue of exceeding the model's maximum context length of 8192 tokens, you can split the documents into smaller chunks before processing them with the LlamaParseJsonNodeParser
. Here's an example of how you can achieve this using the SemanticChunker
class for effective chunking:
from llama_index.core import Document
from llama_index.packs.node_parser_semantic_chunking.base import SemanticChunker
from llama_index.embeddings.openai import OpenAIEmbedding
# Initialize the SemanticChunker with the desired settings
semantic_chunker = SemanticChunker(
buffer_size=1, # Number of sentences to include in each chunk
embed_model=OpenAIEmbedding(), # Use OpenAI embedding model
breakpoint_percentile_threshold=95.0 # Percentile threshold for chunking
)
# List of documents to be chunked
documents = []
for _, page in enumerate(json_list):
documents.append(
Document(
text=page.get("text"),
metadata=page,
)
)
# Parse the documents into chunks
chunks = []
for doc in documents:
chunks.extend(semantic_chunker.split_text(doc.text))
# `chunks` now contains the chunked documents ready to be passed to the LLM
This code uses the SemanticChunker
to split the documents into semantically meaningful chunks, considering sentence boundaries and semantic similarity, which can then be passed to the LLM [1].
To continue talking to Dosu, mention @dosu.
You'll probably need to run a second pass of chunking to ensure some maximum chunk size
from llama_index.core.node_parser import SentenceSpltter
splitter = SentenceSplitter(chunk_size=2048, chunk_overlap=128)
json_node_parser = LlamaParseJsonNodeParser(
llm=OpenAI(model="gpt-3.5-turbo"), num_workers=16, include_metadata=True
)
nodes = json_node_parser(documents)
nodes = splitter(nodes)
8192 is the maximum size for openai embeddings
The json node parser does not enforce a maximum size on its own
Question Validation
Question
@dosu iam using LlamaParseJsonNodeParser, for parsing the documents,iam using 8196 context window model.how can i split the documents into chunks for example this is my code to store the text in document list,give an sample example of chunksing this documents value and pass to llm: documents = [] for _, page in enumerate(json_list): documents.append( Document( text=page.get("text"), metadata=page, ) ) node_parser = LlamaParseJsonNodeParser( llm=OpenAI(model="gpt-3.5-turbo"), num_workers=16, include_metadata=True ) nodes = node_parser.get_nodes_from_documents(documents)