launchdarkly / python-server-sdk

LaunchDarkly Server-side SDK for Python
https://docs.launchdarkly.com/sdk/server-side/python
Other
38 stars 44 forks source link

StreamingUpdateProcessor reconnect seems to hang when hitting stream_read_timeout #209

Closed pballandras closed 1 year ago

pballandras commented 1 year ago

Is this a support request? šŸ™…

Describe the bug During the outage of May 18th 2023, we saw a number of calls simply hanging. The code using the SDK would simply stop and nothing seemed to be going on. I dug into the code briefly, but couldn't figure out entirely what caused our calls to simply hang. We call in our code client.variation(...) (where client: LDClient) and that works well during non-outage periods. Of course, during outages, stuff goes wrong, but I'm not sure if it's this call that hangs.

If you look at the logs down below, you'll see that the logs (that end there, it's not cut) end at exactly 5min from the time the StreamingUpdateProcessor initialized. In the code, the timeout is set at 5min as well.

So, to describe the bug, it would seem that there might exist an infinite loop around the logic for the connection (or reconnection) to stream.launchdarkly.com.

Edit: Here's where I was in the code:

To reproduce Ok so guys

  1. create an outage at LD :trollface:

No but for real, since the circumstances are kind of extreme, I'm not sure how to reproduce it. My guess is that by timing out the heartbeat on the StreamingUpdateProcessor you can reproduce this, but then again, I can't be sure... Sorry šŸ™

Expected behavior No hanging. Throwing an error would be absolutely fine in that case.

Logs

2023-05-18 11:35:56 INFO     Starting event processor
2023-05-18 11:35:56 INFO     Starting StreamingUpdateProcessor connecting to uri: https://stream.launchdarkly.com/all
2023-05-18 11:35:56 INFO     Waiting up to 5 seconds for LaunchDarkly client to initialize...
2023-05-18 11:35:56 INFO     StreamingUpdateProcessor initialized ok.
2023-05-18 11:35:56 INFO     Started LaunchDarkly Client: OK
[... time goes fast ...]
2023-05-18 11:40:56 WARNING  Unexpected error on stream connection: HTTPSConnectionPool(host='stream.launchdarkly.com', port=443): Read timed out., will retry
2023-05-18 11:40:56 INFO     Will reconnect after delay of 0.704595s
2023-05-18 11:51:11 WARNING  Error posting diagnostic event (will retry): HTTPSConnectionPool(host='events.launchdarkly.com', port=443): Read timed out. (read timeout=15)

SDK version 6.13.3

I thought we were on a more recent version, but looking at the changelog, I didn't find any explicit fix for this, but I may be wrong.

Language version, developer tools Python 3.9

OS/platform Ubuntu 18.04

Additional context If I omitted anything, don't hesitate to ask me, I've been on the other side of support cases so I know users tend to forget the details.

Cheers!

keelerm84 commented 1 year ago

Thank you for writing up this detailed issue, and of course my apologies for the issues you were experiencing.

I will try to look at this as soon as I can and let you know once we have a resolution.

Filed internally as sc-204447

keelerm84 commented 1 year ago

Good morning. I have been looking into this, but so far I haven't been able to reproduce the problem.

To confirm a few things:

pballandras commented 1 year ago

Good morning! It is a hard case I presented you with indeed.

Do you have any special routing rules setup in your environment? Any proxies or firewalls that might affect these connections?

I would think not since the rest of the time we are fine. This happened only twice too and very close to the outage. This said I'll do a bit of research to be sure.

Was there unusually high load on the system during these times?

No, it's actually a pretty light workload.

Are you using a remote data store (e.g. redis) or are you using the default in memory store?

We use the default client, like this:

def _get_launch_darkly_client() -> LDClient:
    sdk_key = os.getenv(LAUNCHDARKLY_SDK_KEY_SECRET_ENV_VARIABLE) or _get_sdk_key()
    ldclient.set_sdk_key(sdk_key)
    return ldclient.get()

You mention in the ticket looking at specific lines, but those don't seem to match up with the version you stated you are using. Just want to confirm this is a problem in 6.13.3 right?

Yes that's my bad, I was convinced (then thoroughly disappointment that we were not) that we were on at least 7.x. So I looked on the 8x and or 7.x only to realize we were on 6.13.3.

Are there other processes within your application which might have locked the GIL and prevented other threads from continuing?

At that point, nothing was in the way, the only thing running was the flag fetch.


Keep in mind this might be somewhat of a hard edge case. I'm going to check out more in details our side too in the meantime.

pballandras commented 1 year ago

I'll close this as I don't think either of us want to spend more time on this. The problem has not come back and happened only during extreme situations, it's simply not worth both our time to investigate more. I'll see you guys again if there's ever another problem of the sort!

Cheers!