launchdarkly / erlang-server-sdk

LaunchDarkly Server-Side SDK for Erlang/Elixir
Other
33 stars 37 forks source link

Improve SDK concurrency and fault toleracy #132

Open beligante opened 2 days ago

beligante commented 2 days ago

Is your feature request related to a problem? Please describe. Currently, we have the Erlang SDK running at scale in our services, tho, recently, our team start to face, what we believe to be an issue with the increased amount of feature flag evaluations we run in runtime. Our team started to notice some small crashes that happen with the SDK when some of the processes are too busy, either processing the messages on mailbox or communicating with LD APIs.

Most of the problems start with this crash: image

So, what we observe here is (I'm using the module names for better reference) ldclient_event_server is attempting to synchronously talk to ldclient_event_process_server, tho that process is busy (and our guess is that it's sending a batch of events to LD server, but there's a huge stack of messages on the mailbox) and the call timeout, crashing the ldclient_event_server process. Okay, so, what's the main problem here?

ldclient_event_server is a singleton process, which means that, if it crashes, all other processes that are attempting to send messages to it will crash because the process no longer exists.

This is the root cause of the problem we're trying to solve here. What we expect from the SDK team is to handle these situations more gracefully, but not only that we would like to make some more suggestions.

Describe the solution you'd like

Describe alternatives you've considered Currently, our workaround in production environment was to disable the events synchronization with the server, which is not ideal as we would like to have the evaluation graphs populated.

kinyoklion commented 2 days ago

Hello @beligante,

Thank you for the feedback. In regards to the get_last_server_time, I think we can make this concurrent without issue, and I will make a task to do that. (Filed internally as SDK-677)

There is the potential that ldclient_event_process_server could be a pool at some point, but unlikely ldclient_event_server would be. Part of the purpose of the structure is managing the volume of events transmitted. The ldclient_event_server event server only manages a queue with a max capacity and the batches are flushed asynchronously to this process. It could be blocked less by more flushing processes, but we can remove the blocking behavior all together by removing the relationship with get_last_server_time. If we make other changes, and there are still scaling issues, then we would consider having more processes for flushing.

We will also look into handling event disablement without requiring a message exchange. (Filed internally as SDK-678)

Thank you, Ryan