Split.IO Client Interferes With Other Threads

mro-rhansen2 commented 2 years ago

Hello,

We've noticed a couple of instances where the split.io client seems to cause issues with other threads running within a single Python 3.8 process. The client (version 9.1.0) is running in in-memory mode and we do not use Django.

The most pressing is a production issue. After a few days of uptime, the Split.IO client enters into what appears to be some sort of endless loop. It looks to be reauthenticating and then deciding what method it should use to fetch changes from the server (streaming/polling). It does so every 50-60min endlessly until the pod is restarted. There is also a kafka consumer client that is running in follow-mode in tandem to the split client within the same process. The kafka client is unable to fetch any additional messages from the cluster while Split.IO is in this state.

I'd love to be able to provide a concrete set of steps to reproduce this behavior. Unfortunately, it takes days to present itself so it isn't something that we have time to devote attention towards. If y'all are unable to assist, then we'll likely need to drop split from our stack entirely because we don't have the bandwidth to be chasing ghosts.

The other issue is related to the first. We've noticed that Split.IO completely blocks the asyncio event loop. Is there some guidance that y'all might have for dealing with this? We know that we can work with asyncio to effectively manage threads using the event loop, but I am personally hesitant to move in that direction given the observed behavior above. I imagine that asyncio would also end up getting blocked as well.

Thanks for any assistance that you can provide. Let us know if there's any additional information that you can use from us to help troubleshoot the problem.

mredolatti commented 2 years ago

Hello @mro-rhansen2, Thank you very much for reaching out. I'm sorry to hear you're having trouble with our SDK. I will do my best to answer some of your questions and see how we can help you.

First of all, the authentication flow is done once per hour because we use 1-hour valid JWTs for our streaming connections. This is a security-driven design decision. The process however is relatively simple and should not have an impact on your application. Basically the steps are as follows:

The current SSE connection is closed (the current SSE thread ends)
A request is made to our authorization service
If a valid JWT is returned a new connection to our streaming service is started on a new thread. Otherwise, periodic tasks are started to hit our services every N seconds and fetch flag updates.

Could you provide more information on how exactly your other threads are affected?

Regarding AsyncIO, unfortunately our SDK doesn't support channeling it's network traffic through it right now, but I'm curious to understand how our SDK running in it's own threads affects your I/O loop, which i'm assuming runs on it's own thread (whether it's the app's main one or not). Would it be possible for you to provide some code samples to reproduce this? Or is this the scenario that only happen after a couple of days?

If you can't share a code to reproduce this issue, would you mind sharing how you setup our SDK and how you setup the event loop? Feel free to contact our official support to share those.

Thank you very much for you patience. Regards, Martin.

mro-rhansen2 commented 2 years ago

The Kafka consumer fetch itself becomes blocked and stops producing new messages. We're using the consumer in follower mode by treating it as an iterable. That loop is the only thing that keeps our process alive so the pod would have simply restarted if there was some issue with the connection itself. Instead, the process stays alive but the consumer appears blocked. The only thing that is still alive is the Split.IO SDK based on the log messages I provided earlier.

We were able to run this for weeks at a time without seeing this behavior prior to including Split.IO in that particular workflow. We're simply assuming that the streaming that Split.IO is doing behind the scenes is the culprit given that it is the only thread that is still doing anything. However; that assumption is currently being reinforced by the fact that the process is no longer blocking after dropping the Split.IO client.

Python supports threads, but threads can block each other by preventing the GIL from being released. This can happen if the running thread is caught up in a long running process that keeps the CPython interpreter loop from running and attempting to release the lock at the predefined 5ms intervals. I have not crawled through the Split.IO stack to see if that is happening but the below article explains the behavior in detail:

https://pythonspeed.com/articles/python-gil/

I actually noticed the AsyncIO edge-case by happenstance. I had developed a command within a CLI project that utilized an AsyncIO event loop. This CLI project in particular had a module that instantiated a singleton instance of the Split.IO client. The command worked fine under test because that module wasn't being loaded. However; as soon as I ran the command against the built CLI, I could see that my command would run until the event loop was started. After that, the only thing that would run at all was the Split.IO client. The resolution was simply to lazy load that Split.IO client so that it wouldn't load until a command that actually needed it was executed - problem solved.

Our client setup is rather simple. You can see it below:

def get_split_client():
    factory = get_factory(settings.SPLIT_API_KEY)
    try:
        factory.block_until_ready(10)  # wait up to 10 seconds
    except TimeoutException:
        # Now the user can choose whether to abort the whole execution, or just keep going
        # without a ready client, which if configured properly, should become ready at some point.
        pass

    return factory.client()

P.S: I was using uvloop instead of the default AsyncIO event loop implementation, which could have been the culprit in that particular use case.

mredolatti commented 2 years ago

Hi @mro-rhansen2 thank you very much for your response.

Actually we have no CPU intensive tasks running on any thread. There are a couple of hash functions that are offloaded to a C lib because they're faster than a pure-python interpreted solution, but they only run on a get_treatment call, not in BG threads. I Agree that the GIL has some edge cases, but because of the way threading works in python (which is usually by doing context switches as soon as an I/O operation (or a sleep) is invoked in the current thread), i would have expected the sdk to perform decently with an event loop running on another dedicated thread.

We will need to run some POCs, replicate the issue and find the best way to approach a solution.

In the meantime, I would recommend:

disabling streaming. This can be easily done when instiating the factory by calling factory = get_factory(settings.SPLIT_API_KEY), config={'streamingEnabled': False})
going through our official support if you haven't already, to make sure that someone gets assigned to look into this issue.

Please let us know if disabling streaming helps.

Once again, we apologize for the issues this is causing. Regards,

Martin.

Tim020 commented 2 years ago

Hi @mro-rhansen2,

Have you tried disabling streaming to see whether this helps?

Thanks,

Tim

mro-rhansen2 commented 2 years ago

Hello again! This bit us again in our current iteration so I've finally got time to muck around with this again. Thanks for the tip on disabling streaming. I didn't notice that in the docs at the time but honestly I was moving fast and furious to try and rectify the production issue.

I will try that and provide feedback once I have any. Unfortunately it is entirely indeterminate insofar as when the problem will present itself so all we can do is make a change and wait a week to see if we had any issues during that timeframe. I'll escalate with the official support team if disabling streaming doesn't get us past our current hurdle.

mro-rhansen2 commented 2 years ago

Just wanted to drop by to confirm that disabling streaming seems to have resolved the problem. It has been nearly two weeks at this point and we have not noticed the Kafka thread becoming unresponsive during that time.

mredolatti commented 2 years ago

Thanks for letting us know. Hopefully at some point we'll be able to work on a proper asyncio integration.

Thanks, Martin

mro-rhansen2 commented 2 years ago

Thanks @mredolatti, but the project that I deactivated streaming on in this case doesn't actually employ asyncio. The conflict seems to be between the kafka-python polling thread and the splitio_client[cpphash] 9.1.0 streaming thread, although the split client does also seem to clash with the uvloop implementation of the asyncio event loop abstraction.

Quick question on deactivating streaming: does that only impact gestures and impressions or are there other side effects that we should be aware of?

mredolatti commented 2 years ago

oh that's weird, can you share what version of the kafka library you were using, and if you have a snippet (no matter how generic) on how you're using those it'd be great as well.

regarding the streaming feature of our sdks, it's used to have the SDK listen on an SSE connection for changes made to the targeting rules, so that it's not periodically polling our APIs just to see if there's anything new. By disabling it, the sdk will fallback to polling, and changes in features may take a couple of seconds more be reflected on an SDK instance. But it has no effect whatsoever on impressions or manually generated events by calling the .track() method on our client

mro-rhansen2 commented 2 years ago

Here yah go, this is a rough abstraction of what things look like from our end. I'm not sure sure if this runs in this exact state because I scratched it together just now but it should serve to demonstrate our setup.

Keep in mind that it would take a good while before we actually ran into an issue where the Kafka polling thread would become unresponsive. If you're going to test it, then you'll need the simulation to run for an extended period of time in order to reproduce the issue. The time it takes to get to that point seems to vary based on how many messages have been processed with the shortest time to failure correlating with higher throughput but that is really just an observation. I have not captured metrics to prove that is the case.

https://gist.github.com/mro-rhansen2/b46493414289b3a2d833fe4f93c88cb3

mro-rhansen2 commented 2 years ago

DOH! Forgot to mention that this is a Python 3.8 project. The code runs within a container hosted on a Kubernetes cluster on the AWS cloud. The base image for our container is python:3.8-slim-buster.

Let me know if I can provide any additional information. I greatly appreciate that you're looking into this though.

mredolatti commented 2 years ago

Thank you! Unfortunately i can't work on this at the moment, but i'll try to have some conversations within the team so that it gets prioritized and looked at in the near future. all the context that may help us reproduce the issue is more than welcome.

Thank you Very much! Martin.

chillaq commented 4 months ago

Hi @mro-rhansen2, the latest version of Python SDK 10.0.1 support asyncio, please upgrade.

Thanks Bilal

splitio / python-client

Split.IO Client Interferes With Other Threads #259