openai / openai-python

The official Python library for the OpenAI API
https://pypi.org/project/openai/
Apache License 2.0
21.99k stars 3.03k forks source link

Constant timeouts after multiple calls with async #769

Closed Inkorak closed 9 months ago

Inkorak commented 10 months ago

Confirm this is an issue with the Python library and not an underlying OpenAI API

Describe the bug

Constant timeouts after multiple asynchronous calls. It was discovered when using the Llama_Index framework that when calls are made to this library through the openai-python client wrapped with async, constant timeouts begin. If you do this without async or with asynchrony, but on the old version like 0.28, then there are no problems.

To Reproduce

Several calls in a row, for example, to embeddings that are wrapped with asynс.

Code snippets

No response

OS

ubuntu

Python version

Python 3.11.4

Library version

v1.2.0 and newer

RobertCraigie commented 10 months ago

Hi @Inkorak, I can't reproduce the issue you're seeing. Can you share a code snippet?

This snippet passes for me:

import anyio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def main() -> None:
    for _ in range(10):
        await client.embeddings.create(input="Hello world!", model="text-embedding-ada-002")

anyio.run(main)
ashwinsr commented 10 months ago

I can confirm this issue is affecting us as well. We recently upgrade from 0.28 to 1.2.3 and 12 hours later the timeouts began.

RobertCraigie commented 10 months ago

@ashwinsr can you share any more details?

Is this only happening when the client has been in use for a prolonged period of time?

ashwinsr commented 10 months ago

I'm trying really hard to build a minimal failing example, but I haven't gotten one yet. Basically we have a FastAPI server that uses the Async OpenAI client with streaming responses. After a while of running, the vast majority of calls to await client.chat.completions.create will give us timeouts.

We are currently on 1.2.3.

Any suggestions on what we can do to troubleshoot this / help you fix? This is a P0 for us right now.

RobertCraigie commented 10 months ago

Are you seeing connection pool timeouts or is it a request timeout?

ashwinsr commented 10 months ago

We are seeing pool timeouts and some request timeouts. Give me a second and I'll pull some more specific logs for you.

GCODIN commented 10 months ago

It’s just hard to adapt to the correct diagram that is built on each software and since there is so much technology out that is the only way it can be guarded, which is by having barriers in this case cookies, soon it will certificates and badges earned individually so we need to start earning them I believe.

On Sat, Nov 11, 2023 at 13:13 Robert Craigie @.***> wrote:

@ashwinsr https://github.com/ashwinsr can you share any more details?

Is this only happening when the client has been in use for a prolonged period of time?

— Reply to this email directly, view it on GitHub https://github.com/openai/openai-python/issues/769#issuecomment-1806918633, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD4QLGRVSUWVZJVTWJE2Y23YD7TBHAVCNFSM6AAAAAA7F3ZRAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWHEYTQNRTGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RobertCraigie commented 10 months ago

Okay, there was a bug reported recently with streaming responses not being closed correctly. But I did manage to reproduce that and push a fix so I'm surprised you're still seeing connection pool timeouts: https://github.com/openai/openai-python/issues/763

Do you have a lot of concurrent requests happening at once?

ashwinsr commented 10 months ago

@RobertCraigie We are seeing

  1. Some httpcore.PoolTimeout's
  2. Some regular timeouts (although this is possibly FastAPI timing out on waiting on OpenAI, but apologies something just logged the timeout error, we will improve our logging herre)

Thoughts?

ashwinsr commented 10 months ago

Not that many concurrent requests (think <20 at a time)

ashwinsr commented 10 months ago

Here's one traceback:

File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1299, in _request response = await self._client.send(request, auth=self.custom_auth, stream=stream) File "/usr/local/lib/python3.10/site-packages/sentry_sdk/integrations/httpx.py", line 137, in send rv = await real_send(self, request, **kwargs) File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1620, in send response = await self._send_handling_auth( File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1648, in _send_handling_auth response = await self._send_handling_redirects( File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1685, in _send_handling_redirects response = await self._send_single_request(request) File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1722, in _send_single_request response = await transport.handle_async_request(request) File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 352, in handle_async_request with map_httpcore_exceptions(): File "/usr/local/lib/python3.10/contextlib.py", line 153, in exit self.gen.throw(typ, value, traceback) File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions raise mapped_exc(message) from exc

RobertCraigie commented 10 months ago

Okay thanks, do you have debug logging enabled?

If you could share debug logs for openai, httpx & httpcore it would be incredibly helpful.

ashwinsr commented 10 months ago

Also some regular timeout errors:

File "open_ai.py", line 111, in get_function_chat_completion response = await client.chat.completions.create( File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 1191, in create return await self._post( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1480, in post return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1275, in request return await self._request( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1331, in _request return await self._retry_request(options, cast_to, retries, stream=stream, stream_cls=stream_cls) File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1362, in _retry_request return await self._request( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1331, in _request return await self._retry_request(options, cast_to, retries, stream=stream, stream_cls=stream_cls) File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1362, in _retry_request return await self._request( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1332, in _request raise APITimeoutError(request=request) from err openai.APITimeoutError: Request timed out.

ashwinsr commented 10 months ago

@RobertCraigie unfortunately don't have debugging logging enabled already, and turning it on now might not help much because we're likely going to have to downgrade to the old API version until we can get this figured out (because we can't just wait for our production traffic to fail...)

RobertCraigie commented 10 months ago

@ashwinsr okay no worries, I would suggest trying to explicitly close stream responses (see issue linked earlier for an example) if you can before downgrading. I'll try to figure out what's happening.

ashwinsr commented 10 months ago

Got it, I'll do that. What would the code snippet be to close the connection for your embedding example at the top of the page?

RobertCraigie commented 10 months ago

Unfortunately you'd likely have to update your code to use raw responses, https://github.com/openai/openai-python?tab=readme-ov-file#accessing-raw-response-data-eg-headers. I would be very surprised if standard requests are a cause of this issue and it would help narrow this down if you left them as-is for now but I totally understand if you'd rather explicitly close responses there as well.

Also just to be clear, you definitely shouldn't have to explicitly close responses, I just suggested it as a temporary workaround so you don't have to downgrade.

ashwinsr commented 10 months ago

Alright @RobertCraigie we turned on debug logging, and left as is. I'll update here with the next set of failed logs to see if we can find the root cause.

Liu-Da commented 10 months ago

same situation

wistanch commented 10 months ago

Running into the same issue, had to swap the library out for direct calls to OpenAI API with aiohttp.

RobertCraigie commented 10 months ago

Thank you all for the additional confirmations, can you share any more details about your setup? The following would be most helpful:

If anyone can share a reproduction that would also be incredibly helpful.

RobertCraigie commented 10 months ago

Also any examples of code using the openai package would be helpful.

RobertCraigie commented 10 months ago

Additionally, we did recently fix a bug related to this so please ensure you're on the latest version! v1.2.3

RobertCraigie commented 10 months ago

Update: we've been able to reproduce the httpx.ReadTimeout issue but I have not been able to reproduce the pool timeout issue.

I have been able to reproduce the httpx.ReadTimeout issue while making raw requests using httpx directly so this may not be an issue with the SDK itself.

This issue may be related: https://github.com/encode/httpx/issues/1171

The underlying error I get is this:

Traceback (most recent call last):
  File "/Users/robert/stainless/stainless/dist/openai-python/.venv/lib/python3.9/site-packages/anyio/streams/tls.py", line 131, in _call_sslobject_method
    result = func(*args)
  File "/Users/robert/.rye/py/cpython@3.9.18/install/lib/python3.9/ssl.py", line 889, in read
    v = self._sslobj.read(len)
ssl.SSLWantReadError: The operation did not complete (read) (_ssl.c:2633)
RobertCraigie commented 10 months ago

I've pushed a fix for the httpx.ReadTimeout issue I managed to reproduce, this will be included in the next release: https://github.com/openai/openai-python/pull/804

RobertCraigie commented 10 months ago

A fix has been released in v1.2.4! Please let us know if that fixes the issue for you.

rattrayalex commented 10 months ago

Just to clarify, we'll now auto-retry in these cases, but that retry will only happen after the request times out (which by default is 10 minutes, because some chat completions can take quite a long time).

You can configure the timeout to be lower if it makes sense for your application, and we've filed https://github.com/openai/openai-python/issues/809 to track ideas for smarter timeout windows.

However, there remains an underlying issue in httpx or one of its sub-dependencies related to hangs after SSLWantReadError, which we've filed here: https://github.com/encode/httpx/discussions/2941

We're working to debug the underlying issue in the httpx, anyio, and/or ssl libs.

zhengligs commented 10 months ago

yeah, i still get hangs when using the async APIs and calling

await client.beta.threads.messages.create

the second time in the same process.

The synchronous APIs don't have this problem.

bmax commented 10 months ago

also still getting this error: traceback for reference:

Traceback (most recent call last):\n  File \"/usr/local/lib/python3.10/site-packages/anyio/streams/tls.py\", line 131, in _call_sslobject_method\n    result = func(*args)\n  File \"/usr/local/lib/python3.10/ssl.py\", line 917, in read\n    v = self._sslobj.read(len)\nssl.SSLWantReadError: The operation did not complete (read) (_ssl.c:2578)\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/site-packages/httpcore/_backends/anyio.py\", line 34, in read\n    return await self._stream.receive(max_bytes=max_bytes)\n  File \"/usr/local/lib/python3.10/site-packages/anyio/streams/tls.py\", line 196, in receive\n    data = await self._call_sslobject_method(self._ssl_object.read, max_bytes)\n  File \"/usr/local/lib/python3.10/site-packages/anyio/streams/tls.py\", line 138, in _call_sslobject_method\n    data = await self.transport_stream.receive()\n  File \"/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 1203, in receive\n    await self._protocol.read_event.wait()\n  File \"/usr/local/lib/python3.10/asyncio/locks.py\", line 214, in wait\n    await fut\n  File \"/usr/local/lib/python3.10/asyncio/futures.py\", line 285, in __await__\n    yield self  # This tells Task to wait for completion.\n  File \"/usr/local/lib/python3.10/asyncio/tasks.py\", line 304, in __wakeup\n    future.result()\n  File \"/usr/local/lib/python3.10/asyncio/futures.py\", line 196, in result\n    raise exc\nasyncio.exceptions.CancelledError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/site-packages/httpcore/_exceptions.py\", line 10, in map_exceptions\n    yield\n  File \"/usr/local/lib/python3.10/site-packages/httpcore/_backends/anyio.py\", line 32, in read\n    with anyio.fail_after(timeout):\n  File \"/usr/local/lib/python3.10/site-packages/anyio/_core/_tasks.py\"
rattrayalex commented 10 months ago

Which version is that error still occurring on?

On Mon, Nov 13 2023 at 11:23 PM, Brandon Max @.***> wrote:

also still getting this error: traceback for reference:

Traceback (most recent call last):\n File \"/usr/local/lib/python3.10/site-packages/anyio/streams/tls.py\", line 131, in _call_sslobject_method\n result = func(*args)\n File \"/usr/local/lib/python3.10/ssl.py\", line 917, in read\n v = self._sslobj.read(len)\nssl.SSLWantReadError: The operation did not complete (read) (_ssl.c:2578)\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.10/site-packages/httpcore/_backends/anyio.py\", line 34, in read\n return await self._stream.receive(max_bytes=max_bytes)\n File \"/usr/local/lib/python3.10/site-packages/anyio/streams/tls.py\", line 196, in receive\n data = await self._call_sslobject_method(self._ssl_object.read, max_bytes)\n File \"/usr/local/lib/python3.10/site-packages/anyio/streams/tls.py\", line 138, in _call_sslobject_method\n data = await self.transport_stream.receive()\n File \"/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 1203, in receive\n await self._protocol.read_event.wait()\n File \"/usr/local/lib/python3.10/asyncio/locks.py\", line 214, in wait\n await fut\n File \"/usr/local/lib/python3.10/asyncio/futures.py\", line 285, in await\n yield self # This tells Task to wait for completion.\n File \"/usr/local/lib/python3.10/asyncio/tasks.py\", line 304, in __wakeup\n future.result()\n File \"/usr/local/lib/python3.10/asyncio/futures.py\", line 196, in result\n raise exc\nasyncio.exceptions.CancelledError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.10/site-packages/httpcore/_exceptions.py\", line 10, in map_exceptions\n yield\n File \"/usr/local/lib/python3.10/site-packages/httpcore/_backends/anyio.py\", line 32, in read\n with anyio.fail_after(timeout):\n File \"/usr/local/lib/python3.10/site-packages/anyio/_core/_tasks.py\"

— Reply to this email directly, view it on GitHub https://github.com/openai/openai-python/issues/769#issuecomment-1809529157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFL6LUR6LCIT5ZBCTGWX73YELW5JAVCNFSM6AAAAAA7F3ZRAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGUZDSMJVG4 . You are receiving this because you modified the open/close state.Message ID: @.***>

bmax commented 10 months ago

@rattrayalex openai==1.2.4

benrapport commented 10 months ago

I am also still seeing this error as well.

OpenAI==1.2.4, Python==3.10.12, httpx==0.24.1, httpcore==0.17.3

RobertCraigie commented 10 months ago

Can you share any debug logs? These exceptions should be retried and I haven't seen it fail twice in a row yet.

https://github.com/openai/openai-python?tab=readme-ov-file#logging

Oscmage commented 10 months ago

I did the same migration from 0.28 to 1.2.4

I ended up getting timeout issues with the following code

client = openai.AsyncOpenAI(
            api_key=api_key,
        )
client.chat.completions.create(
            model="<MODEL NAME HERE>",
            messages=messages,
            timeout=10,
        )

When I instead moved the timeout to be set on the client it all started working again:

client = openai.AsyncOpenAI(
            api_key=api_key,
            timeout=10,
        )
client.chat.completions.create(
            model="<MODEL NAME HERE>",
            messages=messages,
        )

I'll see if I can get the debug logs for the old version.

RobertCraigie commented 10 months ago

Thanks, from what I've seen this issue occurs entirely randomly. So the reason your timeout change "fixed" the issue is likely just because you haven't been unlucky enough to run into it again yet.

Oscmage commented 10 months ago

Yup, looks like you are right.

gench commented 10 months ago

I have the same issue! I just updated my openai python package from 0.28 to 1.2.4 and unfortunately it started hanging on the async calls but this behaviour is totally random. Sometimes it completes everything perfectly. There is no relation between the number of async calls and hanging probability. It may complete 100 async calls perfectly yet fail in 2 calls.


client = AsyncOpenAI(api_key = os.environ.get("OPENAI_API_KEY"))
client.chat.completions.create(
                            messages=messages,
                            **model_parameters
                        )
benrapport commented 10 months ago

Is there any workaround that folks have found to be a suitable workaround for the time being?

gench commented 10 months ago

The only workaround I could find is to revert back to 0.28 and use the previous way of calling:

openai.ChatCompletion.create(
                            messages=messages,
                            **model_parameters
                        )
rattrayalex commented 10 months ago

Thanks everyone for your patience, we're working hard on this. It's a very tricky bug in httpx, our underlying http client (or perhaps below that in the anyio and/or ssl packages).

We're not aware of a workaround for this right now, other than downgrading or making that request with an alternative library like requests or aiohttp (though the bug can still appear in those libraries, just at a reduced frequency).

RobertCraigie commented 10 months ago

Would anyone that is still seeing ReadTimeout exceptions raised when using v1.2.4 or greater be able to share a full stack trace?

As of that version, the library should be retrying these requests after timeout (default is 10min) and the requests typically succeed after retrying.

Setting as tight a timeout as your application can handle is the best workaround we know at this time, e.g.,

from openai import AsyncOpenAI

client = AsyncOpenAI(timeout=30) # 30s by default across all API calls

client.chat.completions.create(messages=[…], timeout=30) # alternatively, can configure per-request
makaralaszlo commented 10 months ago

The problem still exists in 1.3.2. After triggering the content security policies multiple times, the requests time out, and the whole solution froze.

If we restart the application using the openai library, it will work again smoothly (until it is broken with the security policies).

kmmbvnr commented 9 months ago

May I suggest to pin this issue on top of Github issues list? I spent a significant amount of time dealing with this problem before realizing that others are facing the same issue.

makaralaszlo commented 9 months ago

We managed to find out the problem. If you use multiple threads (for example 2) that run async coroutines, and both use the same OpenAI object, it will be stuck.

But interestingly, we have two threads A and B. We create the OpenAI object on thread A and call the OpenAI object's function simultaneously on A and B threads. If the B thread starts before thread A, it will be stuck in the HTTP core module in the connection file pool lock in thread B. But if thread A starts first then thread B also finish without stucking.

The solution in our case creates the Async OpenAI object on both threads A and B.

rattrayalex commented 9 months ago

Thank you for the suggestion @kmmbvnr , we've done that!

Thank you for the report @makaralaszlo – is there any chance you could share a repro script or repo?

@agronholm, the maintainer of anyio, a core package that underlies httpx, spent a fair bit of time looking into this and concluded that despite user reports, the problem does not seem isolated to the client library and can happen in either httpx (which this version uses) and aiohttp (which v0.28.0 uses), and is a result of misbehaving servers or networking layers.

His findings are here: https://github.com/encode/httpx/discussions/2941#discussioncomment-7608766

OpenAI intends to work on the server-side issue causing this problem, but no progress is expected for at least the next week. We know this is a material problem and we really wish there was more we could do right now – we've spent many hours debugging.

For visibility, I'll re-share that the best workaround we know of today is to reduce the timeout to the minimal tolerable timeout:

from openai import AsyncOpenAI

client = AsyncOpenAI(timeout=30) # 30s by default across all API calls

client.chat.completions.create(messages=[…], timeout=30) # alternatively, can configure per-request

If you're making a number of requests simultaneously, you may also want to increase your connection pool limits as the defaults are a little tight and can exacerbate this issue.

makaralaszlo commented 9 months ago

The following example codes show how the error occurs and how to solve it: https://github.com/karpator/openai_threading_async_error

agronholm commented 9 months ago

Why are you involving threads there if you're already using async?

makaralaszlo commented 9 months ago

This is from an example of an extensive monolith application, where async functions are used for the I/O tasks, and there are CPU-heavy loads that are not necessary to be waited by the async tasks; these are sent out to a new thread. The same occurs if we use the run_in_executor() function of asyncio.

Using OpenAI 0.28.1, it worked perfectly before.

agronholm commented 9 months ago

The biggest problem with stucking_example.py is that it's creating multiple event loops. I cannot fathom how it ever worked before, but that must've been by chance. I suggest you use the synchronous API instead.

makaralaszlo commented 9 months ago

Yeah, you are right, @agronholm, so the main problem in our case is not the library-based problem, just it used to work before, but I think it is possible for others in this messaging also to make such a mistake, so for them, it would be nice to check if the same conditions apply.

In the example, only one event loop will be created for a different thread. These functions are executed on separate threads using the AsyncThreadingHelper class. If both threads try to access the same instance of OpenAIAdapter concurrently, it can result in race conditions. That's the problem in our case, and it can be solved using separate instances of OpenAIAdapter for each thread or creating a thread-safe mechanism to synchronize access to the shared resource, as you mentioned. Or to create separate instances of the OpenAIAdapter.

@Inkorak, if you are referring to this library https://github.com/run-llama/llama_index there might be the same problem it utilizes the async and threading at the same time. (I didn't deep dive into the code, but it worth checking it)