microsoft / sample-app-aoai-chatGPT

Sample code for a simple web chat experience through Azure OpenAI, including Azure OpenAI On Your Data.
MIT License
1.54k stars 2.38k forks source link

BUG: Getting quota errors in webApp but not in the playground #131

Open ealasgarov opened 1 year ago

ealasgarov commented 1 year ago

Hello! I'm getting this error in webApp everytime upon 2nd message (the 1'st one is fine):

Error
Requests to the Creates a completion for the chat message Operation under Azure OpenAI API version 2023-03-15-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 7 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.

But via Azure Studio OpenAI playground everything is working fine for the same deployment/model -- I can ask 10 question in a row, no issues. Why is that happening? Why "Completion" and not "prompt"?

I've tried with different models, ie 3.5-turbo, gpt-4, new and old, everywhere it is the same. Currently used model is 0301 3.5-turbo in west europe. My token limit is set to 7K.

P.S. I'm actually running the image in the kubernetes cluster not in WebApp, but I guess it shouldn't make any difference in this case.

ealasgarov commented 1 year ago

I've tried playing around with API version, but no luck:

when in app.py I set api to be 2023-06-01-preview instead of originally set 2023-03-15-preview -- then it's even worse, fails with the same error straight upon the first message.

With the older one 2022-12-01, it says -- Resource not found.

I left it now with the stable release "2023-05-15" -- same behavior as with 2023-03-15-preview.

https://learn.microsoft.com/en-us/azure/ai-services/openai/whats-new#may-2023

ealasgarov commented 1 year ago

ahh maybe still something specific to my kubernetes deployment... this is what i see in the pod logs:

GET /assets/Send-d0601aaa.svg => generated 0 bytes in 0 msecs (HTTP/1.1 304) 4 headers in 185 bytes (0 switches on core 0)
2023-08-03T15:20:54.437250234Z [pid: 1|app: 0|req: 15/15] 172.22.129.13 () {70 vars in 5804 bytes} [Thu Aug  3 15:20:52 2023] POST /conversation => generated 468 bytes in 1582 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 0)
2023-08-03T15:21:00.899604966Z ERROR:root:Exception in /conversation
2023-08-03T15:21:00.899645167Z Traceback (most recent call last):
2023-08-03T15:21:00.899648867Z   File "app.py", line 252, in conversation
2023-08-03T15:21:00.899651467Z     return conversation_without_data(request)
2023-08-03T15:21:00.899653967Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-08-03T15:21:00.899656668Z   File "app.py", line 214, in conversation_without_data
2023-08-03T15:21:00.899659368Z     response = openai.ChatCompletion.create(
2023-08-03T15:21:00.899661868Z                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-08-03T15:21:00.899665768Z   File "/usr/local/lib/python3.11/site-packages/openai/api_resources/chat_completion.py", line 25, in create
2023-08-03T15:21:00.899668468Z     return super().create(*args, **kwargs)
2023-08-03T15:21:00.899670968Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-08-03T15:21:00.899674568Z   File "/usr/local/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
2023-08-03T15:21:00.899677168Z     response, _, api_key = requestor.request(
2023-08-03T15:21:00.899680968Z                            ^^^^^^^^^^^^^^^^^^
2023-08-03T15:21:00.899685568Z   File "/usr/local/lib/python3.11/site-packages/openai/api_requestor.py", line 230, in request
2023-08-03T15:21:00.899689069Z     resp, got_stream = self._interpret_response(result, stream)
2023-08-03T15:21:00.899692269Z                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-08-03T15:21:00.899695869Z   File "/usr
/local/lib/python3.11/site-packages/openai/api_requestor.py", line 624, in _interpret_response
2023-08-03T15:21:00.899699269Z     self._interpret_response_line(
2023-08-03T15:21:00.899702269Z   File "/usr/local/lib/python3.11/site-packages/openai/api_requestor.py", line 687, in _interpret_response_line
    raise self.handle_error_response(
2023-08-03T15:21:00.899708469Z openai.error.RateLimitError: Requests to the Creates a completion for the chat message Operation under Azure OpenAI API version 2023-05-15 have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 53 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.
2023-08-03T15:21:00.899786772Z [pid: 1|app: 0|req: 16/16] 172.22.129.13 () {70 vars in 5805 bytes} [Thu Aug  3 15:21:00 2023] POST /conversation => generated 335 bytes in 14 msecs (HTTP/1.1 500) 2 headers in 91 bytes (1 switches on core 0)

can't guess where did it go wrong....

pamelafox commented 1 year ago

What TPM do you currently have for your deployment? Each question takes an average 1000 tokens, so it is easy to exceed the rate limits if your deployments have low TPM.

ealasgarov commented 1 year ago

Thanks Pamela, just not sure why I would not get the same result on the OpenAI Studio in this case... I'll try setting the limit higher then.

JSv4 commented 1 year ago

Having the same issue with a totally different stack FWIW. Using a llama-index and Python stack in a k8s deployment, and I'm getting this message like one in 4 to one in 10 messages