Bypassing Context length / MaxToken length of LLMs using DSPy in context self instruct

sreenivasmrpivot commented 1 year ago

I have tried using Llama V2 to generate synthetic data for self instruct. Unfortunately my Prompts are long and the prompt / response combination from the Llama 13b chat model constantly exceeds the 4096 token limitation.

Is there any way to bypass this limitation using DSPy with the Llama v2 model? Should I be using the chat model or any other model to be able to use in context self instruct with DSPy and Llama v2?

Are there any examples in DSPy documentation which I can refer to?

drawal1 commented 1 year ago

I switched to using gpt-3.5-turbo-16k to get around this problem, but its a paid/closed model. Perhaps someone here can suggest an equivalent open source/free model

sreenivasmrpivot commented 1 year ago

I guess Giraffe model has longer context and can get around it. So if I understand correctly, DSPy cannot help with this problem. The only way is to choose a model with longer context.

okhat commented 1 year ago

Using a long-context model is the easiest thing.

But DSPy is a general framework. You can implement at least 5-6 different ways to deal with long context in your own logic. Think chunking with map/reduce style, etc.

If you can provide more details, I can suggest 1-2 approaches

sreenivasmrpivot commented 1 year ago

@okhat I am trying to implement Gorilla model which uses API from HuggingFace, TensorflowHub and PytorchHub. My goal is to generate synthetic data using a fully open source model and avoid using GPT4 for commercially viable reasons. So I want to make use of Llama 2, provide in-context self instruct prompts and get some output. However when I try to do that directly using text prompting, I exceed the 4096 tokens allowed by Llama. I end up getting the error Exception has occurred: APIError Invalid response object from API: '{"detail":{"object":"error","message":"This model\'s maximum context length is 4096 tokens. However, you requested 6276 tokens (2180 in the messages, 4096 in the completion). Please reduce the length of the messages or completion.","type":"invalid_request_error","param":null,"code":null}}' (HTTP response code was 400)

I am using vLLM and I guess you work with Rick Battle to some extent, I am trying to get this implemented and contribute to Rick's team.

Any suggestions are much appreciated.

okhat commented 1 year ago

Thanks @sreenivasmrpivot. Yes we collaborate with Rick very frequently!

However, you requested 6276 tokens (2180 in the messages, 4096 in the completion)

This error seems like your input isn't actually that long. The prompt is just 2180 tokens. Do you need 4096 output tokens?

Maybe just set the output to 256 tokens? Or 512?

sreenivasmrpivot commented 1 year ago

@okhat I have attached my actual input prompt here. Do you still think I can get around the problem by controlling the output to 256 or 512? If yes, where can I set the output length in the code?

sample1.txt

The output from model is expected to have 10 "API-Inst pair - examples", which is pretty long.

If I use llama 2 13b, which has amax tokens of 4096, is there anyway to get this expected output using the combination of dspy and llama 2 13b?

If It is not possible, I am considering usage of https://huggingface.co/NousResearch/Yarn-Llama-2-13b-128k instead of llama 2 13b

sreenivasmrpivot commented 1 year ago

@okhat do you have any suggestions or updates for this ^^^?

drawal1 commented 1 year ago

@sreenivasmrpivot you can increase max_tokens as follows: llm = dspy.OpenAI(model='gpt-3.5-turbo-16k', max_tokens=8000)

Off the top, could you generate one API-Inst pair at a time and pass the "instruction"'s of the previously generated API-Inst pairs., asking the model not to generate an AP-Inst pair similar to the ones already generated?

sreenivasmrpivot commented 1 year ago

@drawal1 I like the suggestion regarding max_tokens and generating 1 pair at a time though I am not sure if the generation would avoid repetitions unless I try it.

However since gpt-3.5-turbo-16k has 16k context length, it might work. Would the above approach work for llama 2 which is only 4k in context length.

drawal1 commented 1 year ago

4k tokens is roughly 3000 words, so llama 2 4k context might work. You won't know until you try

okhat commented 1 year ago

max_tokens refers to the maximum output tokens, @sreenivasmrpivot

setting it to 4000 for llama only makes sense if your input is empty, which it isn’t

just set to 512 or consider restructuring the output to be one at a time as @drawal1 suggests

okhat commented 1 year ago

Is this resolved?

ahoho commented 10 months ago

I'm also having an issue with this---if I compile a Module with a teleprompter, then try to run it forward, it often creates prompts that are too long. Is there a way to avoid this?

okhat commented 10 months ago

Hey @ahoho yes happy to help. I may need more details but basically:

you can reduce the parameters of the teleprompter (max_bootstrapped_demos and max_labeled_demos) for a start. They default to 4 and 16, respectively. Maybe do 1 and 0 to be extreme.

ahoho commented 10 months ago

Yes, I think this is the issue, the demonstrations end up creating a prompt that's too long. I think it's because I'm mirroring a RAG setting for classification, and the context is repeated for each of the bootstrapped demos.

okhat commented 10 months ago

@ahoho Oh wow I just saw this by accident, not sure why I missed it earlier.

Did my suggestion resolve it? Setting max_bootstrapped_demos=1 and max_labeled_demos=0, assuming you're doing BootstrapFewShotWithRandomSearch

ahoho commented 9 months ago

Sorry, I also missed your response! Yes, that did resolve the problem

stanfordnlp / dspy

Bypassing Context length / MaxToken length of LLMs using DSPy in context self instruct #101