launchdarkly / ruby-server-sdk

LaunchDarkly Server-side SDK for Ruby
https://docs.launchdarkly.com/sdk/server-side/ruby
Other
34 stars 52 forks source link

Streaming PATCH updates are not reflected in client store #76

Closed bjreath closed 7 years ago

bjreath commented 7 years ago

I'm running into the following scenario while upgrading to ldclient-rb 2.0.3 from ldclient-rb 0.7.0. The gem is used within a Rails 4.2.7.1 application on Puma.

The feature flags are working fine in our development environment when the application first loads, and it appears the correct values are returned initially. However, if a user goes to the LaunchDarkly web interface and toggles a flag, the update is not reflected within the Rails application until the Rails server is killed and restarted. This led me to believe we might be having an issue with handling PATCH server-sent event passed back from LaunchDarkly when the flag is changed.

In config/initializers/launch_darkly.rb:

config = LaunchDarkly::Config.new(connect_timeout: 2, read_timeout: 2, offline: false)
Rails.configuration.features_client = LaunchDarkly::LDClient.new(ld_api_key, config)

We then use this global in our code to check for enabled features. As is the default, the LaunchDarkly::StreamProcessor is configured as the update processor within the client. I put some logging into this class to see if the PATCH SSEs are being received and handled, and it does appear that they are.

message = JSON.parse(message.data, symbolize_names: true)
@config.logger.info("[LDClient] Before #{@store.get(message[:path][1..-1])}")
@store.upsert(message[:path][1..-1], message[:data])
@config.logger.info("[LDClient] After #{@store.get(message[:path][1..-1])}")

This correctly reflects a new version number for the flag and the proper on value in the message.

However, any time the global LaunchDarkly::LDClient#variation method is accessed on Rails.configuration.features_client for the flag, the variation code returns the initial version that was present when the LDClient was initialized on this line, which causes us to hit the condition here that returns the default flag value. It seems the update to the store is not being picked up from the client (even though accessing the store directly from the StreamProcessor logging statements above seem to indicate that the store was updated properly).

I'm continuing to dig in to see if I can identify the issue, but any pointers would be appreciated.

justinucd commented 7 years ago

Hi Brian,

Do you mind creating this as a support ticket? You can email your message to support@launchdarkly.com and it will automatically create a ticket. I will start investigating this right away though.

Best, Justin

bjreath commented 7 years ago

@justinucd Done. Let me know if I can help with any other info, or if you want to hop on a ScreenHero or Skype call.

bjreath commented 7 years ago

This was an issue with how we were establishing connections in a forked process environment (e.g. when using Spring or Puma). We are now reestablishing connections in Puma's on_worker_boot hook and Spring's after_fork hook and all is well.

bjreath commented 7 years ago

@justinucd So handling things the way I mentioned above works fine to get the streaming updates to work, but now I'm getting this issue whenever I do a rails console and then exit out of it:

screen shot 2016-11-10 at 11 32 34 am

Here's the code we're using to establish our connection:

screen shot 2016-11-10 at 11 31 16 am

This calls a simple helper function that returns the configured client, which all seems to be working fine.

The connection is being established, but for some reason it seems like Celluloid can't shut down its actors properly and my console hangs for a bit before exiting.

mattbradley commented 7 years ago

@bjreath

We are now reestablishing connections in Puma's on_worker_boot hook and Spring's after_fork hook and all is well.

How are you reestablishing the LD connection? Are you just setting your global to a new instance of LaunchDarkly::LDClient?

I've tried that (creating an LDClient in an initializer then later in Puma's on_worker_boot), but it seems that the original LDClient sticks around receiving streaming events. This is evident from the duplicated log messages when the client receives PATCH updates:

[LDClient] Stream received patch message
[LDClient] Stream received patch message

I can solve the problem by removing the initializer, but then the client won't be available in rake tasks or rails console unless explicitly instantiated. So, I was wondering if you managed to solve this problem?

Manfred commented 7 years ago

I can verify that this problem exists in other environments, in our case we've seen it in Pow.cx (Rack on Nodejs), Passenger in Apache, and Resque. So I'm guessing the problem exists in forking environments. Eventually I solved the problem by installing after fork hooks for the various frameworks.

Edited after posting

Manfred commented 7 years ago

The downside of using after fork to recreate the client is that it adds a lot of launch time to the forked child on the first variation query.

I tried solving this by keeping the feature store around so I could take the initialization hit once in the parent fork. I tried passing it into the Config parameters when the client is re-created. This doesn't work because StreamProcessor always starts uninitialized regardless of the state of the feature store.

dlau commented 7 years ago

@Manfred -- is the startup on forked child initialization causing a slow down in your application's critical path or is it incurred at start up?

I've been investigating related celluloid issues regarding cleanly shutting down all running actors. Can someone running into these problems run their code with the following line of code set globally:

require "celluloid/current"
$CELLULOID_DEBUG = true
Manfred commented 7 years ago

is the startup on forked child initialization causing a slow down in your application's critical path or is it incurred at start up?

Currently we initialize the client on the first query so it depends. The problem with the application is that queries are done on the model level so they can be triggered in a web process, in background tasks (Sidekiq and legacy Resque jobs), or in one-off scripts which are triggered either by hand or through cron. We didn't want to force a query on process boot because some processes don't have to query at all.

The problem in our case has nothing to do with actors not shutting down properly as far as I can tell. Sharing IO handles between forks is simply not possible so the current architecture of the gem will never work very well in forked environments. You will have to create an LDClient instance for every forked process.

I think the best solution for this problem depends on the size of the application. For smaller applications dealing with just a few RPM it makes more sense to have easy setup that works similar to the current code. For larger deployments it makes much more sense to have just one process in the architecture take care of synchronization to a local cache and then have all the web, background, and script processes read from that local cache.

dlau commented 7 years ago

@Manfred

In your larger deployments, would something like a redis backed feature store solve your needs? Think a ruby corollary to the java-client RedisFeatureStore

Further, have you taken a look at redis storage and daemon mode in ld-relay?

Manfred commented 7 years ago

@dlau No, I haven't looked at other solutions because I was mostly working on upgrading the gem in the project. We haven't prioritized working on a more optimal integration with LD.

Using a Redis backed feature query interface with a separate update process was exactly what I was thinking about.

dlau commented 7 years ago

Great, I will bring up ruby client redis support with the team.

Regarding the previous issue with initializing LDClient with spring and puma, the documentation can be found here: http://docs.launchdarkly.com/docs/ruby-sdk-reference#section-initializing-ldclient-using-spring-puma

Closing this thread, as the contained issues are no longer relevant to the topic.