launchdarkly / node-server-sdk

LaunchDarkly Server-side SDK for Node
Other
79 stars 65 forks source link

VariationDetails sync method #177

Closed elirazco-ma closed 1 year ago

elirazco-ma commented 4 years ago

Performance overhead calling async methods From what I've seen in the implementation of VariationDetails there aren't any async calls. Are there any plans to expose a sync version of this method to avoid doing redundant async calls? The alternative at my hand is doing something hacky to extend the interface or fork the code.

eli-darkly commented 4 years ago

What you're missing is that both variation and variationDetail have to read at least one feature flag from the data store (and possibly other data too, if the flag turns out to have prerequisite flags or to refer to user segments)— and although the default data store is an in-memory cache that could be read synchronously, that is not the only implementation. If you're using Consul, DynamoDB, or Redis as a persistent store, then reads are necessarily async. Since the SDK has to be able to handle all of those cases, the data store interface has to be async, and therefore so do the variation methods.

eli-darkly commented 4 years ago

It's true that there's overhead to this, and many people use only in-memory caching and not persistent stores. But in order to provide a faster synchronous path for that use case, we would have to basically write the whole flag evaluation code path twice (it's not possible to completely separate the "read the flag from the store" part from the "evaluate the flag" part, because it's not possible to know ahead of time all of the related flags and/or segments that might have to be read).

elirazco-ma commented 4 years ago

Thanks for your detailed response. Do you have any other suggestions on how to overcome this limitation? We are using the in-memory approach and we would really love to avoid the async overhead.

yaniv-moonactive commented 4 years ago

Hi Eli,

I do understand why you designed the interfaces async, it is also best practice for node packages in general. However we are running a system at a high scale and are concerned about performance impacts. i.e. some features in our systems are evaluated for every request and the current implementation creates wasteful promises that serve no purpose.

In our use case we are accepting the limitation that we will not be able to use any async implementations (such as Redis), as in anyway it is not a feasible solution for our requirements. I'm sure other projects would share our use case therefore I think this option should be available.

Regarding your comment about having to duplicate the functions, this I do not understand. We looked into the SDK and (assuming the local memory implementation) it looks like the entire internal implementation is sync, so the only change would be to add a "sync" version which would then call the same internal functions. We would have made our own wrapper if we could, but those internal functions are not exposed, therefore we will have to fork out to expose them (something I would like to a avoid).

Simply exposing those internal functions would also work for us.

eli-darkly commented 4 years ago

We looked into the SDK and (assuming the local memory implementation) it looks like the entire internal implementation is sync, so the only change would be to add a "sync" version which would then call the same internal functions

That's incorrect; I can only assume that you're looking in the wrong place.

If you look in evaluate_flag.js, you'll see that nearly every step in the call chain is async. We really did not do that just for fun; it's that way because 1. any flag can have prerequisite flags, which must then be read from the data store; 2. those prerequisites can have prerequisites of their own, which are handled recursively in the same way; and 3. any clause in a flag rule could use the segmentMatch operator, in which case a user segment must be read from the data store.

And to clarify what I said earlier, that "it's not possible to know ahead of time all of the related flags and/or segments that might have to be read": it is possible to recursively determine the maximum set of things that we might have to read, but since reading from the data store can be expensive (if it's a persistent store), we do not want to read any things we might not actually use. For instance, say that we're evaluating a flag that has 5 prerequisite flags. If the first prerequisite fails (which we can't known until we evaluate it), then we're not going to evaluate the other 4 and so we should not read them. Similarly, if a flag has 5 rules and each rule references a user segment, and the first rule is a match, then the other 4 won't be needed and we shouldn't read those segments.

Now, if we know ahead of time that you are using an in-memory store and you definitely won't ever use a persistent one, then yes it would be safe to read all those things at the start... but then, as I said, we would need to maintain one version of the evaluation code path that does async lazy loading and another that doesn't.

danielmoonactive commented 4 years ago

Hello Eli,

I was wondering if it's possible to expose a variationDetail / variation functions that doesn't create a promise behind the scenes, but instead accepts two functions instead of the "resolve" and "reject" parameters that will be called appropriately. That way, we can avoid creating a new Promise with every call to variation (or variationDetail), which will improve the performance of our application, and rely mostly on callbacks, which is fine for us.

Thanks!

eli-darkly commented 4 years ago

@danielmoonactive: The main difference between creating a promise and calling callbacks directly is that the promise will defer its callbacks using the equivalent of nextTick, and you're right that that that is slightly slower than just calling the callback right away. And the SDK does use direct callbacks internally to maximize performance.

However, in general we have deliberately chosen not to do that at any point where the SDK code interacts with the caller. The issue is this. If I call a function that has async semantics, and usually the callback will be deferred via the event loop (because, for instance, some I/O happens), but sometimes it will be called right away (because it turns out that no I/O is necessary), then there is an ambiguity about execution order and this can lead to subtle problems that are very hard to track down. You can see what I mean in this example:

    someAsyncOperationThatTakesACallback(() => {
        weFinishedTheOperation();
    });
    weStartedTheOperation();

In this example, weStartedTheOperation represents something that we don't want to do until we know that the operation has been started. If we know that the callback will always be deferred by at least one tick, then this is safe. But if the function might sometimes call its callback directly with no deferral, then weFinishedTheOperation will be called before weStartedTheOperation and there's no way for the caller to know ahead of time which way it will be. If the caller and the callee are part of the same codebase (i.e. calls within our internal SDK code) then it's reasonable to say "we're confident that we're dealing with it OK because there are only 3 places in our code where that function is called, and we're not relying on execution order in any of those places, so a direct callback is OK"... but when it's a public API and we have no way to know how the caller's code is structured, that's not a good behavior.

If you look at how the client methods are currently defined (including variation), you'll see that they do in fact have the option of passing a callback instead of returning a promise (using the standard Node pattern where a single callback receives two parameters, (err, result), rather than two callbacks as you suggested). However, under the covers they are using the function wrapPromiseCallback which ensures that the callbacks are always deferred— for the reason I've just described.

ariofrio commented 4 years ago

+1 for a sync version of variation. I am new to LaunchDarkly and this is the main reason why I've been hesistant to implement it, even after getting client buy-in. Implementing a feature flag deep in the code requires me to change the semantics of several functions that call it to return Promises, bubbling up all the way to the top level or near it. The semantic overhead is non-trivial, since:

  1. It's not straightforward to maintain the same behavior after promisification. For example, replacing list.map(() => {...}) with await Promise.all(list.map(async () => {...})) doesn't do the same thing. So now I've got to take a careful look at whether the mapping function in a tricky, network-based piece of code can run in parallel or not, when I was just trying to add a feature to a completely unrelated piece of code.
  2. It makes it harder to reason about performance, since functions that are not IO-bound now appear to be because they return a Promise.
  3. I have to change all calls to every function in the chain that is changed to use Promises, including tests. Then I have to change them all back when the feature flag is cleaned up.

In summary, being forced to use async variation significantly increases the cost of introducing a feature flag in Node.js codebases. I recognize that reducing this cost would add maintenance costs for you. I understand of course, that in the end a decision will have to take both into account. 👍

And just to clarify, when using the in-memory store, variation always returns a Promise that is resolved in the very next tick, right? So it's not IO-bound ever? I tried to ask the support team about this, but they didn't seem to understand the question.

eli-darkly commented 4 years ago

And just to clarify, when using the in-memory store, variation always returns a Promise that is resolved in the very next tick, right? So it's not IO-bound ever?

There's no IO involved, but it may not be the very next tick, because there can be additional async operations required during flag evaluation. It's not just about the initial operation of getting the flag out of the data store.

This is because any flag can contain 1. prerequisite references which require another flag to be evaluated, and/or 2. rules that contain clauses that use the segment match operator, requiring a user segment to be evaluated. Therefore, if the flag has any prerequisites and/or rules, there is an async code path for those things. Prerequisites will definitely involve a data store query (so it will go at least to the next tick); rules only do if they reference a user segment. If there are neither of those things, we are usually able to get through the evaluation flow using direct callbacks instead of deferring anything else onto the event loop— although if a flag has a large number of rules, it might need to use setImmediate to avoid stack overflows (see utils/asyncUtils.js).

I recognize that reducing this cost would add maintenance costs for you

It's not really (or not just) about maintenance or the amount of development time to change things; it's that it's not clear how this ought to work if we were to change it. Would there be two different code paths for evaluation (with, as I mentioned earlier, a lot of code duplication, so yes that is a maintenance issue), one of which would only work if you're using the in-memory store? Or would we try to make it work even if you're using a persistent store, by ensuring that everything is always cached in memory as well?

The latter would require bigger architectural changes, but again it's not clear how it ought to work. The SDK has not necessarily seen any given flag before it went into the data store (you can run it in a configuration that uses only the database and does not get live updates from LaunchDarkly) so if we completely get rid of the possibility that I/O could happen during an evaluation, then we are creating a real chance that the SDK would report that a flag does not exist when it really does.

crmitchelmore commented 3 years ago

New to LD. The first use case I went to try would inject the flag evaluation code in to third party code which is sync (graphql validation rules). Has there been any developments in this space?

eli-darkly commented 3 years ago

@crmitchelmore We'll update this issue as soon as there's anything to say. We've left the issue open because we haven't ruled it out, but we haven't committed to doing it and unfortunately it's not possible to comment on a possible time frame.

samatcolumn commented 1 year ago

Bumping this issue, as it's really blocking us from using LD everywhere on our backend. Now that @eli-darkly is no longer working at LD (according to their Github profile) is there someone else here who is owning this? Maybe @kinyoklion?

kinyoklion commented 1 year ago

Hello @samatcolumn,

We currently have no plans on supporting a synchronous API. Providing a subset of the client API that doesn't work with all features remains problematic (big segments and persistent stores cannot use a synchronous implementation).

Thank you, Ryan

samatcolumn commented 1 year ago

@kinyoklion appreciate the fast response! Just curious (if you have the time to humor me):

For some additional context on why it's so problematic for us to have an async function control feature flags, I think this classic blog post is the best explanation: https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/

kinyoklion commented 1 year ago

Hey @samatcolumn,

LaunchDarkly has two paradigms for SDKs. We have client-side SDKs, and server-side SDKs. One could better think of these as single-user (client) and multi-user (server).

In a single-user environment a server, remote from the SDK, does evaluations. These evaluations are then sent to the client during initialization or identify (or loaded from local persistence). Then a variation call just synchronously accesses the value of the flag. (The async parts are handled in the LaunchDarkly back-end)

In a multi-user environment the SDK gets a rule set on init (identify just produces events), then it does evaluations of a context against those rules.

There are two complexities here, which are big segment support and persistent store support. A server SDK can be initialized against a persistent store instead, and then accesses that store whenever a cached values expires or does not exist. So, when you evaluate that flag, either it or one of its prerequisites may not be in memory and needs to be accessed from the store.

The second complexity is a big segment. A big segment is a segment can contain millions of users and is not sent to the server SDK. Instead when a big segment is evaluated the SDK checks for membership of that user in a database.

Ahead of evaluation it isn't known for a specific context will encounter a prereq that isn't in memory if if it will need to evaluate a big segment.

Thank you, Ryan

samatcolumn commented 1 year ago

@kinyoklion thanks for taking the time to explain all of that!