[Question]: prevent max token error after retrieval

Benedixx commented 3 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Hello there, so how do i prevent max token after retrieval? for example on my case my original prompt is 600 +- token

and i'm using text davinci 003 that have 4096 maximum context length. everytime i run the script it gave me this error.

openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens, however you requested 4915 tokens (3415 in your prompt; 1500 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': 
None, 'code': None}}

for the retrieval script, i followed the llama learn documentation. here is my script :

Settings.embed_model = AzureOpenAIEmbedding(
model="text-embedding-ada-002",
deployment_name="corpu-text-embedding-ada-002",
api_key = os.getenv("API_KEY"),
azure_endpoint = os.getenv("API_ENDPOINT"),
api_version ="2024-03-01-preview",)

Settings.llm = AzureOpenAI(
    model='text-davinci-003',
    deployment_name="corpu-text-davinci-003",
    api_key = os.getenv("API_KEY"),
    azure_endpoint = os.getenv("API_ENDPOINT"),
    api_version ="2024-03-01-preview",
    temperature=0.4,
    max_tokens=1500
)

Settings.context_window = 4096
Settings.num_output = 1500

retriever = VectorIndexRetriever(index=reference_index, similarity_top_k=10)

response_synthesizer = get_response_synthesizer()

query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors= [SimilarityPostprocessor(similarity_cutoff=0.7)],
    response_mode="tree_summarize"
)

result = query_engine.query(prompt)

i'm new at this so i have no clue how to optimize this and if you have suggestion to optimize this please let me know.

dosubot[bot] commented 3 months ago

Hey there, @Benedixx! 👋 I'm Dosu, your friendly neighborhood bot, here to help you squash bugs, answer questions, and guide you on your journey to becoming a contributor. While you're waiting for a human maintainer to chime in, I'm here to assist! Just letting you know I'm diving into your issue and will get back to you with a solution shortly. Stay tuned!

logan-markewich commented 3 months ago

max_tokens is the maximum number of tokens to generate. With openai, I would not set this, to let it generate as much as it has room for (and avoid this error)

dosubot[bot] commented 3 months ago

Based on the conversation and the guidance provided by the maintainer, it seems the issue you're encountering, @Benedixx, is due to exceeding the maximum token limit allowed by the model you're using (text-davinci-003). The error message indicates that the combined length of your prompt and the requested completion exceeds the model's maximum context length of 4097 tokens.

The maintainer, Logan, suggests not setting the max_tokens parameter explicitly to avoid this error. This approach allows the model to generate as much content as possible within its token limit, potentially avoiding the error you're encountering.

However, considering your script and the specific error message, it appears that the total requested tokens (4915) exceed the model's limit due to the size of your prompt (3415 tokens) plus the completion length (1500 tokens). Here are a few suggestions to optimize your script and avoid exceeding the token limit:

Reduce the Prompt Size: Try to shorten your prompt to ensure the combined length of the prompt and the completion does not exceed the model's maximum token limit. Since your prompt is significantly large (3415 tokens), reducing it could help stay within the limit.
Adjust the max_tokens Parameter: While Logan suggests not setting this parameter, if you continue to encounter issues, consider adjusting it to a lower value that, when combined with your prompt size, does not exceed the model's limit. This might require some trial and error to find the optimal value.
Optimize the Retrieval Script: Review the retrieval script to ensure it's efficiently selecting the most relevant information to include in the prompt. This might involve refining the logic that determines which parts of the retrieved information are most critical to include in the prompt sent to the model.
Use a Different Model: If reducing the prompt size and adjusting the max_tokens parameter do not resolve the issue, consider using a different model that might have a higher token limit, if available and suitable for your needs.

Remember, the key is to ensure the combined length of your prompt and the requested completion stays within the model's maximum token limit. Experimenting with these adjustments should help you find a configuration that works for your specific use case.

If you have further questions or need more assistance, feel free to ask. Good luck with your project, @Benedixx!

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

Benedixx commented 3 months ago

max_tokens is the maximum number of tokens to generate. With openai, I would not set this, to let it generate as much as it has room for (and avoid this error)

Thank you for answering @logan-markewich Recently i tried that, but the prompt now has 5000+ token. Recently i found the solution by lowering the similiarity_top_k to reduce retrieval amount but i don't know the default value of the parameter. I set the value 2 and remove max_token, it not throwing token error anymore.

Maybe if you know there is another way to reduce token consumption, let me know.

run-llama / llama_index

[Question]: prevent max token error after retrieval #12633

Question Validation

Question