Open samkeen opened 1 year ago
See
Use token splitter
Token splitting We can also split on token count explicity, if we want. This can be useful because LLMs often have context windows designated in tokens. Tokens are often ~4 characters.
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
docs[0]
pages[0].metadata
Summary
We currently naively build the prompt with all the text of the returned docs from similarity search done by the vector db.
If this results in a prompt larger than the context window of the LLM we will get an error such as this:
InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 8889 tokens (8633 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.
Solution
Option 1
On ingest, chunk the docs to something like 1k (with 100 overlap). Then we know each returned doc is 1k or less Then when doing the similarity search
ensure k*[chunk size] is reasonably less than the LLM's context window.
Option 2
Use the LLM to summarize the text of the docs returned by the similarity search to fit within the context window