Closed sreenivasmrpivot closed 9 months ago
I switched to using gpt-3.5-turbo-16k to get around this problem, but its a paid/closed model. Perhaps someone here can suggest an equivalent open source/free model
I guess Giraffe model has longer context and can get around it. So if I understand correctly, DSPy cannot help with this problem. The only way is to choose a model with longer context.
Using a long-context model is the easiest thing.
But DSPy is a general framework. You can implement at least 5-6 different ways to deal with long context in your own logic. Think chunking with map/reduce style, etc.
If you can provide more details, I can suggest 1-2 approaches
@okhat I am trying to implement Gorilla model which uses API from HuggingFace, TensorflowHub and PytorchHub. My goal is to generate synthetic data using a fully open source model and avoid using GPT4 for commercially viable reasons.
So I want to make use of Llama 2, provide in-context self instruct prompts and get some output.
However when I try to do that directly using text prompting, I exceed the 4096 tokens allowed by Llama. I end up getting the error
Exception has occurred: APIError Invalid response object from API: '{"detail":{"object":"error","message":"This model\'s maximum context length is 4096 tokens. However, you requested 6276 tokens (2180 in the messages, 4096 in the completion). Please reduce the length of the messages or completion.","type":"invalid_request_error","param":null,"code":null}}' (HTTP response code was 400)
I am using vLLM and I guess you work with Rick Battle to some extent, I am trying to get this implemented and contribute to Rick's team.
Any suggestions are much appreciated.
Thanks @sreenivasmrpivot. Yes we collaborate with Rick very frequently!
However, you requested 6276 tokens (2180 in the messages, 4096 in the completion)
This error seems like your input isn't actually that long. The prompt is just 2180 tokens. Do you need 4096 output tokens?
Maybe just set the output to 256 tokens? Or 512?
@okhat I have attached my actual input prompt here. Do you still think I can get around the problem by controlling the output to 256 or 512? If yes, where can I set the output length in the code?
The output from model is expected to have 10 "API-Inst pair - examples", which is pretty long.
If I use llama 2 13b
, which has amax tokens of 4096
, is there anyway to get this expected output using the combination of dspy
and llama 2 13b
?
If It is not possible, I am considering usage of https://huggingface.co/NousResearch/Yarn-Llama-2-13b-128k
instead of llama 2 13b
@okhat do you have any suggestions or updates for this ^^^?
@sreenivasmrpivot you can increase max_tokens as follows:
llm = dspy.OpenAI(model='gpt-3.5-turbo-16k', max_tokens=8000)
Off the top, could you generate one API-Inst pair at a time and pass the "instruction"'s of the previously generated API-Inst pairs., asking the model not to generate an AP-Inst pair similar to the ones already generated?
@drawal1 I like the suggestion regarding max_tokens and generating 1 pair at a time though I am not sure if the generation would avoid repetitions unless I try it.
However since gpt-3.5-turbo-16k has 16k context length, it might work. Would the above approach work for llama 2 which is only 4k in context length.
4k tokens is roughly 3000 words, so llama 2 4k context might work. You won't know until you try
max_tokens refers to the maximum output tokens, @sreenivasmrpivot
setting it to 4000 for llama only makes sense if your input is empty, which it isn’t
just set to 512 or consider restructuring the output to be one at a time as @drawal1 suggests
Is this resolved?
I'm also having an issue with this---if I compile a Module
with a teleprompter
, then try to run it forward, it often creates prompts that are too long. Is there a way to avoid this?
Hey @ahoho yes happy to help. I may need more details but basically:
you can reduce the parameters of the teleprompter (max_bootstrapped_demos and max_labeled_demos) for a start. They default to 4 and 16, respectively. Maybe do 1 and 0 to be extreme.
Yes, I think this is the issue, the demonstrations end up creating a prompt that's too long. I think it's because I'm mirroring a RAG setting for classification, and the context is repeated for each of the bootstrapped demos.
@ahoho Oh wow I just saw this by accident, not sure why I missed it earlier.
Did my suggestion resolve it? Setting max_bootstrapped_demos=1 and max_labeled_demos=0, assuming you're doing BootstrapFewShotWithRandomSearch
Sorry, I also missed your response! Yes, that did resolve the problem
I have tried using Llama V2 to generate synthetic data for self instruct. Unfortunately my Prompts are long and the prompt / response combination from the Llama 13b chat model constantly exceeds the 4096 token limitation.
Is there any way to bypass this limitation using DSPy with the Llama v2 model? Should I be using the chat model or any other model to be able to use in context self instruct with DSPy and Llama v2?
Are there any examples in DSPy documentation which I can refer to?