Token generation limit - Githubissues

xiscoding commented 1 year ago

I was getting the RateLimitReached Error yesterday (after around the 80th generation each prompt is around 10000 tokens). My simple workaround is below, but is there a better way?

def generate_examples(tokenizer, prompt, number_of_examples):
    # Generate examples
    prev_examples = []
    for i in range(number_of_examples):
        try:
            print(f'Generating example {i}')
            prompt_tokens = tokenizer.tokenize(prompt)
            prev_examples_tokens = [tokenizer.tokenize(example) for example in prev_examples]
            total_tokens = len(prompt_tokens) + sum(len(tokens) for tokens in prev_examples_tokens)
            print(f'Tokens in prompt and previous examples: {total_tokens}')
            example = generate_example(prompt, prev_examples, temperature)
            print(example)
            prev_examples.append(example)
    #         if i % 5 == 0:
    #             time.sleep(10)
        except openai.error.RateLimitError:
            print("RATELIMITREACHED: waiting 10 seconds")
            time.sleep(10)

here is a local version: https://github.com/xiscoding/local_gpt_llm_trainer

fredzannarbor commented 1 year ago

Having same issue. Matt probably has a way higher token limit than most of us!

nurena24 commented 1 year ago

same issue

tuanha1305 commented 1 year ago

I believe that this error is due to openAI limiting your requests. You can increase the rate limit or adjust the retry backoff. To increase the rate limit, you need to submit a form here: https://docs.google.com/forms/d/e/1FAIpQLSc6gSL3zfHFlL6gNIyUcjkEv29jModHGxg5_XGyr-PrE2LaHw/viewform. You can also learn more about model limits: https://platform.openai.com/docs/guides/rate-limits/error-mitigation.

xiscoding commented 1 year ago

I believe that this error is due to openAI limiting your requests. You can increase the rate limit or adjust the retry backoff. To increase the rate limit, you need to submit a form here: https://docs.google.com/forms/d/e/1FAIpQLSc6gSL3zfHFlL6gNIyUcjkEv29jModHGxg5_XGyr-PrE2LaHw/viewform. You can also learn more about model limits: https://platform.openai.com/docs/guides/rate-limits/error-mitigation.

The problem with increasing the rate limit for me is the increase in cost. The code I posted above is basically a simple retry backoff and I haven't had any issues with it. I was hoping for a solution that limited the token count of the output as it reaches the rate limit, but this messes up the outputs.

ishaan-jaff commented 10 months ago

You can try using the litellm router if you have multiple deployments of the same model, this will allow you to increase your effective rate limit docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "gpt-3.5-turbo", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = await router.acompletion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

mshumer / gpt-llm-trainer

Token generation limit #1