huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.86k stars 1.04k forks source link

Input validation error: `inputs` tokens + `max_new_tokens` must be <= 2048 on MPT-30b supporting 8k context #628

Closed kalvin1024 closed 1 year ago

kalvin1024 commented 1 year ago

System Info

ValidationError: Input validation error: inputs tokens + max_new_tokens must be <= 2048. Given: 1244 inputs tokens and 1000 max_new_tokens

I am using huggingface text generation inference hosting on a runpod GPU. The model is intended to support 8192 context length but Huggingface text generation inference is imposing a hard limit on the input, how can I resolve this problem?

llm = HuggingFaceTextGenInference(
    inference_server_url=inference_server_url,
    max_new_tokens=1000,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.1,
    repetition_penalty=1.03,
    stream=True
)

and I am using a langchain qa_chain to load up to 5 candidate documents to be inserted in my prompt. Theoretically with MPT this should work, but Huggingface is outputting error, how can I override this limit, otherwise, supporting long context model would become meaningless.

qa_chain = RetrievalQA.from_chain_type(
        llm=llm, # answer bot is the highest quality
        chain_type="stuff",
        retriever=vector_db.as_retriever(search_type = 'mmr', search_kwargs = {
            'k': top_k,
            'filter': {'category': category}# this filters on metadata.
        }),  # pull over 5 most (credible) relatable document from the indexed db to answer the user's question, the data are cited from the metadata indices
        chain_type_kwargs=chain_type_kwargs,
        return_source_documents=True
    )

Information

Tasks

Reproduction

!pip install langchain===0.0.230 openai chromadb==0.3.26 pydantic==1.10.8 GitPython ipython tiktoken runpod text-generation transformers runpod python-dotenv

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
import runpod
import os

os.environ['RUNPOD_API_KEY'] = aaa
runpod.api_key = os.getenv("RUNPOD_API_KEY", "your_runpod_api_key")

gpu_count = 1

pod = runpod.create_pod(
    name="MPT-30B-Instruct",
    image_name="ghcr.io/huggingface/text-generation-inference:0.9.1",
    gpu_type_id="NVIDIA A100 80GB PCIe",
    cloud_type="SECURE",
    docker_args=f"--model-id mosaicml/mpt-30b-instruct --num-shard {gpu_count} --trust-remote-code",
    gpu_count=gpu_count,
    volume_in_gb=225,
    container_disk_in_gb=75,
    ports="80/http",
    volume_mount_path="/data",
)

from langchain.llms import HuggingFaceTextGenInference

inference_server_url = f'https://{pod["id"]}-80.proxy.runpod.net'
llm = HuggingFaceTextGenInference(
    inference_server_url=inference_server_url,
    max_new_tokens=1000,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.1,
    repetition_penalty=1.03,
    stream=True
)

final_answer_prompt_template = """
# INSTRUCTIONS
- instructions to execute the prompt

# CONTEXT
{context}

# QUERY
{question}
"""
FINAL_ANSWER_PROMPT = PromptTemplate(
    template=final_answer_prompt_template, input_variables=["context", "question"]
)

def ask(query, category, top_k=5, show_sources=False):
    display(Markdown(f"### Query\n{query}"))

    chain_type_kwargs = {"prompt": FINAL_ANSWER_PROMPT}
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm, # answer bot is the highest quality
        chain_type="stuff",
        retriever=vector_db.as_retriever(search_type = 'mmr', search_kwargs = {
            'k': top_k,
            'filter': {'category': category}# this filters on metadata.
        }),  # pull over 5 most (credible) relatable document from the indexed db to answer the user's question, the data are cited from the metadata indices
        chain_type_kwargs=chain_type_kwargs,
        return_source_documents=True
    )

    answer = qa_chain({'query': query})

Expected behavior

The model output natural language response as if it is being called on a short prompt.

OlivierDehaene commented 1 year ago

You need to update the MAX_INPUT_LENGTH and MAX_TOTAL_TOKENS env vars to values that suit your usecase.

OlivierDehaene commented 1 year ago

See https://github.com/huggingface/text-generation-inference/blob/44acf72a736346f4b8e969c9453027ca32786d72/launcher/src/main.rs#L133-L149