ValidationError: Input validation error: inputs tokens + max_new_tokens must be <= 2048. Given: 1244 inputs tokens and 1000 max_new_tokens
I am using huggingface text generation inference hosting on a runpod GPU. The model is intended to support 8192 context length but Huggingface text generation inference is imposing a hard limit on the input, how can I resolve this problem?
and I am using a langchain qa_chain to load up to 5 candidate documents to be inserted in my prompt. Theoretically with MPT this should work, but Huggingface is outputting error, how can I override this limit, otherwise, supporting long context model would become meaningless.
qa_chain = RetrievalQA.from_chain_type(
llm=llm, # answer bot is the highest quality
chain_type="stuff",
retriever=vector_db.as_retriever(search_type = 'mmr', search_kwargs = {
'k': top_k,
'filter': {'category': category}# this filters on metadata.
}), # pull over 5 most (credible) relatable document from the indexed db to answer the user's question, the data are cited from the metadata indices
chain_type_kwargs=chain_type_kwargs,
return_source_documents=True
)
Information
[X] Docker
[ ] The CLI directly
Tasks
[X] An officially supported command
[ ] My own modifications
Reproduction
!pip install langchain===0.0.230 openai chromadb==0.3.26 pydantic==1.10.8 GitPython ipython tiktoken runpod text-generation transformers runpod python-dotenv
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
import runpod
import os
os.environ['RUNPOD_API_KEY'] = aaa
runpod.api_key = os.getenv("RUNPOD_API_KEY", "your_runpod_api_key")
gpu_count = 1
pod = runpod.create_pod(
name="MPT-30B-Instruct",
image_name="ghcr.io/huggingface/text-generation-inference:0.9.1",
gpu_type_id="NVIDIA A100 80GB PCIe",
cloud_type="SECURE",
docker_args=f"--model-id mosaicml/mpt-30b-instruct --num-shard {gpu_count} --trust-remote-code",
gpu_count=gpu_count,
volume_in_gb=225,
container_disk_in_gb=75,
ports="80/http",
volume_mount_path="/data",
)
from langchain.llms import HuggingFaceTextGenInference
inference_server_url = f'https://{pod["id"]}-80.proxy.runpod.net'
llm = HuggingFaceTextGenInference(
inference_server_url=inference_server_url,
max_new_tokens=1000,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.1,
repetition_penalty=1.03,
stream=True
)
final_answer_prompt_template = """
# INSTRUCTIONS
- instructions to execute the prompt
# CONTEXT
{context}
# QUERY
{question}
"""
FINAL_ANSWER_PROMPT = PromptTemplate(
template=final_answer_prompt_template, input_variables=["context", "question"]
)
def ask(query, category, top_k=5, show_sources=False):
display(Markdown(f"### Query\n{query}"))
chain_type_kwargs = {"prompt": FINAL_ANSWER_PROMPT}
qa_chain = RetrievalQA.from_chain_type(
llm=llm, # answer bot is the highest quality
chain_type="stuff",
retriever=vector_db.as_retriever(search_type = 'mmr', search_kwargs = {
'k': top_k,
'filter': {'category': category}# this filters on metadata.
}), # pull over 5 most (credible) relatable document from the indexed db to answer the user's question, the data are cited from the metadata indices
chain_type_kwargs=chain_type_kwargs,
return_source_documents=True
)
answer = qa_chain({'query': query})
Expected behavior
The model output natural language response as if it is being called on a short prompt.
System Info
ValidationError: Input validation error:
inputs
tokens +max_new_tokens
must be <= 2048. Given: 1244inputs
tokens and 1000max_new_tokens
I am using huggingface text generation inference hosting on a runpod GPU. The model is intended to support 8192 context length but Huggingface text generation inference is imposing a hard limit on the input, how can I resolve this problem?
and I am using a langchain qa_chain to load up to 5 candidate documents to be inserted in my prompt. Theoretically with MPT this should work, but Huggingface is outputting error, how can I override this limit, otherwise, supporting long context model would become meaningless.
Information
Tasks
Reproduction
Expected behavior
The model output natural language response as if it is being called on a short prompt.